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CHAPTER  1 


Introduction 


This  document  constitutes  the  final  report  for  the  Very  Large  Parallel  Data  Flow 
(VLPDF)  program.  This  introduction  contains  the  VLPDF  program  background,  pro¬ 
gram  overview,  our  overall  approach,  and  an  outline  of  the  rest  of  this  report,  including 
the  key  results  contained  in  each  chapter. 

1.1.  Background 

Battle  management  involves  managing  large  volumes  of  real  time  data,  interpreting 
this  data,  correlating  data  from  multiple  sources  into  a  multidimensional  view  of  the 
world,  predicting  enemy  actions,  monitoring  incoming  information  for  enemy  threats 
and  inconsistencies  with  predictions,  and  planning  effective  countermeasures.  The  large 
volumes  of  incoming  data  and  the  short  response  times  required  will  force  computers  to 
take  over  many  of  the  analysis  and  decision-making  functions  currently  performed  by 
humans.  This  implies  the  use  of  knowledge  based  techniques  to  implement  these  sophis¬ 
ticated  functions.  Battle  management  systems  will  therefore  be  required  to  manage 
large  data/knowledge  bases. 

What  does  managing  a  large  data/knowledge  base  mean?  What  are  the  functions 
of  a  Data/Knowledge  Base  Management  System  (D/KBMS)?  These  questions  are  being 
investigated  by  several  research  groups.  However,  Brodie’s  definition  seems  to  capture 
the  essence  of  a  knowledge  base  management  system.  He  defines  a  knowledge  base 
management  system  as  "a  system  providing  highly  efficient  management  of  large,  shared 
knowledge  bases  for  knowledge-directed  applications"  [Brod86].  While  there  is  ongoing 
debate  about  the  functionality  of  a  D/KBMS,  this  and  other  definitions  imply  that,  at  a 
very  minimum,  a  D/KBMS  must  provide  a  set  of  facilities  analogous  to  the  data 
definition,  data  manipulation,  data  access,  and  data  integrity  facilities  provided  by  a 
database  management  system  (DBMS). 

A  D/KBMS  is  a  combination  of  two  different  search  engines  —  an  inferential  search 
engine  and  a  query  evaluation  search  engine.  The  key  technical  challenge  in  designing  a 
D/KBMS  is  performance,  since  an  inappropriate  combination  of  these  two  search 
engines  can  lead  to  very  poor  response  times.  The  performance  issue  is  particularly 
significant  in  a  battle  management  environment,  due  to  the  large  size  of  the 
data/knowledge  base  and  the  short  response  times  required.  A  D/KBMS  must  deliver 
very  high  performance,  to  be  effective  in  such  an  environment. 

The  objective  of  the  VLPDF  program  was  to  investigate  the  use  of  parallel  process¬ 
ing  in  very  large  data/knowledge  base  management  as  a  means  of  attaining  the  required 
performance  levels.  The  program  is  motivated  by  the  following  observations:  1)  special 
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purpose  parallel  architectures  have  been  shown  to  provide  high  performance  for  large 
database  applications,  and  2)  database  management  is  the  minimum  functionality  a 
D/KBMS  must  provide.  Parallel  processing  techniques  have  been  used  in  database 
management  to  provide  substantially  higher  performance  than  mainframe  based  systems 
for  large  database  applications.  For  example,  Thinking  Machine  Corporation’s  64,000 
processor  Connection  Machine  performs  five  times  faster  than  a  Cray  mainframe  com¬ 
puter  in  searching  a  text  database  of  10  billion  characters.  Teradata  Corporation’s 
DBC/1012,  a  parallel  relational  database  machine,  is  another  high  performance  engine 
for  large  database  applications.  The  DBC/1012  can  be  scaled  up  to  1024  processors, 
each  with  a  Gigabyte  of  disk,  for  a  total  database  size  of  a  Terabyte.  Both  machines 
embody  special  purpose  parallel  architectures:  the  Connection  Machine,  a  special  pur¬ 
pose  architecture  optimized  for  operations  on  large,  complex  data  structures,  and  the 
DBC/1012,  a  special  purpose  architecture  optimized  for  relational  operations.  The 
emphasis  of  the  VLPDF  program  was  on  investigating  the  use  of  such  special  purpose 
parallel  architectures  for  very  large  data/knowledge  base  management. 

1.2.  Program  Overview 

The  VLPDF  program  spanned  24  months  and  was  divided  into  three  phases. 

1.2.1.  Phase  I 

This  phase  involved  investigation  of: 

•  parallel  processing  techniques  for  inference  processing. 

•  parallel  processing  techniques  for  very  large  database  management,  and 

•  fault  tolerance  techniques  for  very  large  databases. 

Both  inference  processing  and  database  management  were  included  because  a 
D/KBMS  is  a  combination  of  an  inferential  search  engine  and  a  query  evaluation  search 
engine.  The  investigation  included  algorithms  and  architectures  for  relational  database 
machines,  and  different  forms  of  parallelism  present  in  inference  processing  —  AND, 
OR,  stream,  search,  etc. 

Fault  tolerance  was  included  because  a  D/KBMS  must  be  highly  available,  to  be 
effective  in  a  battle  management  environment.  The  fault  tolerance  investigation  was  to 
emphasize  high  data  availability  techniques  for  parallel  data/knowledge  base  manage¬ 
ment  systems,  where  the  larger  number  of  components  involved  may  render  the  system 
more  susceptible  to  failure. 

1.2.2.  Phase  II 

This  phase  also  spanned  9  months  and  involved  development  of: 

•  a  methodology  for  specifying  various  architecture  approaches  for  large 
data/knowledge  base  management  systems,  and 

•  a  set  of  guidelines  for  choosing  among  them. 
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1.2.3.  Phase  III 

This  phase  spanned  6  months  and  involved: 

•  development  of  a  test  plan  for  evaluating  candidate  D/KBMS  architecture 

approaches,  and  j 

•  demonstrating  capability  to  search  and  update  a  very  large  data/knowledge  base.  i 

I 

1.3.  Overall  Approach  | 

One  issue  that  confronted  us  early  in  Phase  I  was  the  choice  of  a  broad  approach  ; 

to  integrating  Artificial  Intelligence  (AI)  and  database  technologies  to  realize  a  i 

D/KBMS.  There  are  basically  four  such  approaches: 

•  Loose  coupling.  Couple  an  existing  AI  system  (Prolog,  Lisp,  existing  expert  system 
shells,  etc.)  with  an  existing  DBMS. 

•  DBMS  extension.  Extend  the  data  model  of  a  DBMS  with  knowledge  representa-  i 

tion  and  inference  capabilities.  , 

•  AI  system  extention.  Include  database  functionality  in  an  AI  system. 

•  Tight  coupling.  Combine  the  AI  and  DBMS  concepts  of  knowledge  representation 
and  data  modeling,  i.e.,  integrate  at  the  so-called  knowledge  level. 

Keeping  in  mind  the  9  month  duration  of  Phase  I,  we  felt  that  it  was  better  to  choose 
one  broad  approach,  after  carefully  considering  all  four,  rather  than  investigating  all 
four  approaches  to  the  same  level  of  detail.  Choosing  an  approach,  we  felt,  would 
enhance  the  chances  of  getting  concrete  results  for  the  program.  The  issue  confronting 
was:  which  of  these  four  broad  approaches  is  best  suited  to  exploiting  the  capabilities  of 
a  parallel  search  engine1 

In  the  loose  coupling  approach,  the  DBMS  is  used  just  as  a  query  evaluation  search 
engine,  with  all  inferential  search  being  done  by  the  AI  system.  The  loose  coupling 
approach,  therefore,  does  not  exploit  the  capabilities  provided  by  parallel  DBMS  archi¬ 
tectures  for  inference  processing,  and  so,  will  most  likely  result  in  a  D/KBMS  that  per¬ 
forms  poorly. 

The  tight  coupling  approach  is  the  most  promising  one.  However,  significant 
technical  challenges  must  be  overcome  before  a  D/KBMS  with  knowledge  level  integra¬ 
tion  becomes  feasible.  Several  research  efforts  are  ongoing  that  address  these  challenges. 

We  believe  that  it  is  too  early  to  tell  how  parallel  search  engines  can  be  exploited  in  the 
tight  coupling  approach. 

Out  of  the  remaining  two  approaches,  we  felt  that  the  DBMS  extension  approach 
was  the  more  promising  one  for  exploiting  parallel  search  engines.  Parallel  relational 
database  machines,  such  as  the  DBC/1012,  provide  very  high  performance  for  large, 
shared  database  applications.  Admittedly,  the  Connection  Machine,  a  so-called  AI 
machine,  has  also  been  shown  to  provide  high  performance  in  searching  large  databases. 

However,  its  effectiveness  for  large,  shared  database  applications  has  not  yet  been  esta¬ 
blished.  This  may  change  in  the  future.  But  for  our  work  in  the  VLPDF  program,  we 
we  shunned  the  AI  extension  approach,  and  instead,  chose  the  DBMS  extension 
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approach. 

In  the  DBMS  extension  approach,  a  D/KBMS  is  viewed  as  a  functional  extension  of 
a  DBMS.  A  database  contains  only  data  in  ex tensional  format,  or  facts.  A 
data/ know  ledge  base,  in  addition,  contains  knowledge  in  intensional  format.  The  inten- 
sional  part  of  the  knowlege  base  is  also  called  the  rule  base.  The  query  language  of  a 
DBMS  can  manipulate  the  extensional  data  stored  in  the  DBMS.  The  query  language 
for  a  D/KBMS,  in  addition,  has  a  deductive  inference  mechanism,  which  interprets  the 
rules,  combining  them  with  the  extensional  data  to  infer  new  facts  not  explicitly  stored. 
A  DBMS  uses  query  compilation  and  optimization,  along  with  special  purpose  parallel 
architectures  to  attain  high  performance  for  large  database  applications.  Likewise,  our 
overall  approach  to  obtaining  high  performance  in  large  data/knowledge  based  applica¬ 
tions  was: 

Compile  queries  to  the  intensional  knowledge  base,  using  appropriate 
optimizations,  into  a  program  that  executes  against  the  extensional 
data.  For  queries  to  the  extensional  knowledge  base,  use  DBMS 
query  compilation  and  optimization  techniques.  In  both  cases,  use  a 
parallel  database  machine  to  execute  the  compiled  program. 

It  was  in  this  manner  we  exploited  special  purpose  parallel  search  engines  for  doing 
deductive  inference,  or  knowledge  directed  retrieval.  The  broad  approach  of  extending 
the  functionality  a  DBMS  to  realize  a  D/KBMS  proved  to  be  very  well  suited  to  leverag¬ 
ing  the  vast  amount  of  research  that  has  gone  into  parallel  database  machine  architec¬ 
tures. 

We  adopted  the  logic  programming  paradigm  for  providing  the  functional  exten¬ 
sion  to  a  DBMS.  Several  research  projects  have  adopted  this  paradigm,  including  the 
Advanced  Database  System  project  at  MCC  and  the  NAIL!  project  at  Stanford.  Logic 
offers  several  advantages: 

•  it  provides  a  uniform  formalism  for  data,  rules,  views,  and  integrity  constraints; 

•  it  is  the  basis  for  relational  database  theory; 

•  it  is  amenable  to  parallel  processing; 

•  it  is  an  adequate  basis  for  implementing  other  knowledge  representations;  and 

•  it  has  a  sound  theoretical  foundation,  which  permits  the  abstract  expression  of 
ideas,  independent  of  their  implementation. 

1.4.  Report  Summary 

This  section  gives  a  brief  description  of  the  chapters  in  the  rest  of  this  report, 
including  the  key  results  contained  in  each  chapter. 

Chapters  2  through  7  describe  the  results  of  the  various  investigation  studies  we 
performed  under  the  VLPDF  contract.  We  performed  a  total  of  six  studies:  (1)  alter¬ 
native  D/KBMS  application  interface  languages,  (2)  parallel  architectures  for  such 
languages,  (3)  data/knowledge  query  processing,  (4)  transitive  closure  algorithms,  (5) 
parallel  database  management  system  architectures,  and  (6)  fault  tolerance  in  very  large 


Si- 


m 

iS; 


■m 
m 


— w 


database  systems. 

Chapter  2  describes  our  investigation  of  alternatives  for  a  D/KBMS  application 
interface  language.  A  D/KBMS  will  be  required  to  support  intelligent  applications  such 
as  planning,  monitoring,  interpretation,  diagnosis,  and  prediction.  These  applications 
will  typically  be  expressed  in  an  expert  system  shell  like  language  in  which  the  D/KBMS 
query  language  is  embedded.  This  is  exactly  analogous  to  data  processing  applications 
expressed  in  Cobol  with  embedded  SQL.  The  motivation  for  studying  D/KBMS  appli¬ 
cation  interface  language  alternatives  is  that  the  overall  performance  depends  not  only 
on  the  D/KBMS  performance  but  also  on  the  execution  efficiency  of  the  application 
interface  language  on  a  hardware  architecture. 

We  identified  three  principal  requirements  for  the  D/KBMS  application  interface 
language:  it  must  be  amenable  to  large  scale  parallelism,  it  must  be  a  suitable  base  for 
implementing  an  expert  system  shell,  and  it  must  support  efficient  non-procedural  data¬ 
base  access.  We  considered  three  language  classes:  imperative,  functional,  and  logi'\ 
We  found  that  no  existing  language  is  uniformly  superior  to  all  others  with  respect  to 
our  requirements.  However,  we  believe  that  PARLOG,  a  parallel  logic-based  language, 
represents  the  best  choice  among  existing  languages. 

Chapter  3  presents  the  results  of  our  investigation  of  parallel  architectures  for  exe¬ 
cuting  PARLOG.  The  form  of  parallelism  present  in  PARLOG  is  called  stream- A ND 
parallelism,  where  several  processes  work  concurrently  on  constructing  the  solution  to  a 
logic  based  query.  We  designed  a  parallel  abstract  machine  for  executing  PARLOG, 
which  lays  bare  all  the  functions  needed  to  execute  PARLOG  programs  as  well  as  the 
data  objects  created  and  manipulated  by  these  functions.  We  also  developed  a  simula¬ 
tor  for  the  abstract  machine,  which  serves  as  a  tool  for  collecting  data  on  the  execution 
behavior  of  PARLOG  and  analyzing  this  data. 

The  principal  conclusion  from  this  work  is  that  shared  memory  greatly  facilitates 
implementing  PARLOG’s  stream-A.YZ?  parallelism  and  that  the  key  to  high  perfor¬ 
mance  stream-AAT>  parallelism  is  an  efficient  shared  memory  abstraction  on  a  loosely 
coupled  architecture.  Our  abstract  machine  described  achieves  this  via  a  number  of 
optimizations.  These  optimizations  address  critical  problems  in  the  design  of  efficient 
parallel  architectures.  They  address  the  principal  sources  of  overhead,  viz.,  communica¬ 
tion  and  memory  latencies,  and  synchronization  overheads. 

We  also  investigated  the  feasibility  of  executing  PARLOG  programs  on  the  Con¬ 
nection  Machine  architecture.  The  conclusion  from  this  work  is  that  a  coarse  grained, 
loosely  coupled  architecture  is  better  suited  than  the  Connection  Machine,  since  the 
form  of  parallelism  best  supported  by  the  Connection  Machine  is  directly  opposite  to 
that  found  in  PARLOG. 

Chapter  4  presents  the  background  concepts  pertaining  to  data/knowledge  base 
query  processing.  The  two  main  concepts  covered  are  least  fixed  point  evaluation  and 
data/knowledge  base  query  optimization. 

One  of  the  difficult  problems  in  the  design  of  a  D/KBMS  is  how  to  evaluate  recur¬ 
sive  queries  efficiently.  Among  the  large  family  of  recursive  queries,  the  transitive 
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closure  query  forms  a  very  important  subset.  They  are  important  because  (1)  a  large 
number  of  recursive  queries  can  be  expressed  using  transitive  closures,  (2)  most  applica¬ 
tion  problems  involving  recursive  queries  which  we  can  see  now  are  actually  transitive 
closure  queries,  and  (3)  efficient  processing  of  transitive  closure  queries  will  provide  a 
sound  base  for  solving  more  complicated  recursive  queries.  We  evaluated  several  algo¬ 
rithms  for  computing  the  transitive  closure  of  a  database  relation.  The  results  of  this 
evaluation  are  presented  in  chapter  5.  Based  on  this  investigation,  we  concluded  that  it 
is  possible  to  further  optimize  transitive  closure  processing.  This  led  us  to  develop  new 
strategies  for  this  problem,  which  we  then  present. 

Chapter  6  presents  the  results  of  our  investigation  of  parallel  architectures  for 
database  management.  Basically,  our  work  in  this  area  has  been  in  the  area  of  parallel 
algorithms  for  the  join  operation.  The  join  operation  is  an  important  operation  for 
relational  database  systems,  and  will  become  even  more  important  as  logic-based  infer¬ 
ence  capabilities  are  added  to  these  systems.  We  describe  a  number  of  multiprocessor 
join  algorithms.  The  algorithms  use  sort-merge  and  hashing  techniques,  and  are  highly 
parallel  and  pipelined.  The  algorithms  are  designed  to  execute  on  a  multiprocessor 
architecture  that  is  parameterized  in  the  degree  of  memory  sharing,  so  that  tightly  cou¬ 
pled,  loosely  coupled,  and  intermediate  architectures  can  be  modeled.  Other  architec¬ 
tural  parameters  include  the  number  of  processors,  number  of  disks,  amount  of  main 
memory,  and  interconnection  network  bandwidth.  We  model  the  performance  of  the 
algorithms  analytically  to  determine  elapsed  time,  resource  utilization,  and  other  quanti¬ 
ties  as  functions  of  the  workload  and  architectural  parameters.  The  join  algorithms 
overlap  computation,  disk  transfers,  and  interconnection  network  transfers.  The 
analysis  models  this  overlap  and  identifies  bottlenecks  that  limit  the  algorithms’  perfor¬ 
mance.  We  do  not  model  multiple  simultaneous  join  operations,  and  therefore  do  not 
compute  system  throughput.  Based  on  this  analysis,  we  answer  the  following  questions: 
(1)  How  do  the  algorithms  compare  in  performance?  When  does  one  outperform 
another?  (2)  How  does  response  time  vary  as  a  function  of  the  architectural  parame¬ 
ters?  (3)  How  does  response  time  vary  with  the  workload?  (4)  Does  shared  memory 
help  algorithm  performance’  To  what  extent?  (5)  What  are  the  architectural 
bottlenecks?  How  could  they  be  alleviated?  Based  on  the  above  study,  we  draw  several 
conclusions  regarding  parallel  processing  of  the  join  operation. 

Chapter  7  presents  the  results  of  our  investigation  of  fault  tolerance  in  very  large 
database  systems.  A  very  large  database  is  usually  heavily  used  and  many  users  and 
applications  depend  on  it.  Downtime  or  unavailability  of  such  a  system  is  expensive 
and  affects  critical  applications  dependent  on  it.  We  studied  the  effect  of  fault  tolerance 
techniques  and  system  design  on  system  availability.  Specifically,  we  attempted  to 
answer  the  following  questions:  What  are  the  main  parameters  that  affect  fault  toler¬ 
ance  of  a  very  large  database  system?  How  do  you  evaluate  their  effect  on  fault  toler¬ 
ance?  How  important  are  various  fault  tolerance  techniques?  What  are  the  trade-offs 
that  should  be  considered  when  designing  a  very  large  database  system  with  a  desired 
degree  of  fault  tolerance?  A  generic  multiprocessor  architecture  is  used  that  can  be 
configured  in  different  ways  to  study  the  effect  of  system  architectures.  Important 
parameters  studied  are  different  system  architectures  and  hardware  fault  tolerance 
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techniques,  mean  time  to  failure  of  basic  components,  database  size  and  distribution, 
interconnect  capacity,  etc.  Quantitative  analysis  compares  the  relative  effect  of  different 
parameter  values.  Results  show  that  the  effect  of  different  parameter  values  on  system 
availability  can  be  very  significant.  System  architecture,  use  of  hardware  fault  tolerance 
(particularly  mirroring)  and  data  storage  methods  emerge  as  very  important  parameters 
under  the  control  of  a  system  designer. 

During  Phase  II,  the  results  of  the  six  investigation  studies  were  combined  to 
develop  a  methodology  for  specifying  high  performance,  highly  available  D/KBMS 
architectures  for  very  large  data/knowledge  base  (D/KB)  environments.  Chapter  8 
describes  this  methodology,  which  is  described  as  a  set  of  policies  and  steps.  The  poli¬ 
cies  art  meant  to  serve  as  a  guide  to  the  D/KBMS  designer  in  making  appropriate  deci¬ 
sions  for  the  following  critical  D/KBMS  design  issues:  overall  D/KBMS  functionality, 
knowledge  representation,  rule  storage,  D/KB  query  processing,  D/KB  update  process¬ 
ing,  D/KBMS  functional  partitioning,  least  fixed  point  evaluation,  join  processing, 
D/KBMS  hardware  architecture,  and  fault  tolerance.  The  steps  are  presented  as  a 
recipe  for  D/KB  query  and  update  processing. 

Briefly,  using  this  methodology,  we  would  design  a  high  performance  D/KBMS  by 
first  designing  a  parallel  relational  database  machine  that  employs  the  parallel  and  pipe¬ 
lined  join  algorithms  described  in  chapter  6.  Next,  we  would  design  parallel  algorithms 
for  LFP  evaluation  using  the  above  join  strategies,  data  flow  and  pipelining  techniques, 
and  semi-naive  evaluation.  Finally,  we  would  design  a  compiler  that  compiles  Horn 
clause  queries  to  relational  algebra  augmented  with  a  general  least  fixed  point  operator 
and  that  uses  the  generalized  magic  sets  strategy  for  restricting  the  search  space  to  the 
relevant  base  relation  tuples. 

Chapter  9  describes  a  data/knowledge  base  management  testbed  that  we  designed 
and  implemented  on  top  of  a  commercial  relational  database  system.  The  testbed  is 
intended  to  serve  as  both  a  demonstration  and  performance  measurement  and  evalua¬ 
tion  platform.  As  a  demonstration  platform,  the  testbed  illustrates  the  motivation  and 
basic  functionality  of  a  D/KBMS,  the  components  of  a  D/KBMS  architecture,  alterna¬ 
tive  implementations  of  these  components  and  their  relative  tradeoffs,  and  the  factors 
contributing  to  D/KB  query  compilation  and  execution  time.  As  a  performance  meas¬ 
urement  and  evaluation  platform,  the  testbed  allows  us  to  make  quantitative  perfor¬ 
mance  measurements  and  to  study  system  performance  sensitivity  and  behavior  with 
respect  to  several  parameters. 

Chapter  10  describes  the  VLPDF  demonstration  plan.  The  demonstration  consists 
of  three  experiments  designed  to  demonstrate  the  motivation  and  functionality  of  a 
D/KBMS  and  the  components  of  a  D/KBMS  architecture. 

Chapter  11  describes  several  experiments  designed  to  quantitatively  measure 
D/KBMS  performance  and  to  understand  D/KBMS  performance  sensitivity  and 
behavior  with  respect  to  various  system  parameters.  The  basic  motivation  for  doing 
these  experiments  is  to  justify  the  D/KBMS  architecture  specification  methodology 
described  in  chapter  8.  That  is,  to  show  that  this  methodology  can  indeed  be  used  to 
design  high  performance  D/KBMSs.  The  chapter  describes  D/KBMS  performance 


measures,  system  parameters  affecting  these  measures,  test  results,  analysis  of  these 
results,  and  the  conclusions  drawn  from  them. 

Chapter  12  summarizes  the  key  conclusions  from  this  work  and  indicates  several 
directions  for  future  work. 


CHAPTER  2 


Application  Interface 


The  choice  of  an  application  interface  language  for  a  D/KBMS  is  a  difficult  one. 
No  existing  language  is  uniformly  superior  to  all  others  with  respect  to  our  require¬ 
ments.  We  have  chosen  PARLOG,  a  parallel  logic-based  language  developed  by  Clark 
and  Gregory  [Clar86].  We  believe  that  PARLOG  represents  the  best  choice  among 
existing  languages.  Its  principal  weaknesses  are  limited  data  structuring  capabilities 
and  possible  inefficiencies  in  implementing  object-oriented  knowledge  representation. 
However,  it  is  superior  to  other  languages  in  many  respects.  In  particular,  it  permits  a 
high  degree  of  parallelism  in  both  procedural  computation  and  database  access. 

The  chapter  is  organized  as  follows.  Section  2.1  lists  our  requirements  for  an  appli¬ 
cation  interface  language.  Section  2.2  discusses  the  three  language  classes  considered: 
imperative,  functional,  and  logic,  and  justifies  our  preference  for  logic-based  languages. 
Section  2.3  compares  alternative  logic-based  languages  and  justifies  our  choice  of  PAR- 
LOG.  Section  2.4  reviews  PARLOG  with  respect  to  the  original  requirements.  Section 
2.5  describes  PARLOG  in  some  detail. 

2.1.  Language  Requirements 

We  have  identified  three  major  requirements  for  the  application  interface  language. 
The  language  must 

•  be  amenable  to  large-scale  parallelism, 

•  be  a  suitable  base  for  implementing  an  expert  system  shell,  and 

•  support  efficient  nonprocedural  database  access. 

The  following  sections  explain  these  requirements. 

2.1.1.  Large-Scale  Parallelism 

The  language  must  be  amenable  to  parallel  execution.  Here,  we  are  looking  for 
large-scale  parallelism,  where  the  degree  of  parallelism  possible  is  proportional  to  the 
volume  of  data  being  processed.  Such  parallelism  is  currently  exploited  in  multiproces¬ 
sor  relational  database  machines  such  as  the  Teradata  DBC-1012  [Tera83].  A  language 
that  permits  only  small-scale  parallelism,  e.g.  pipelined  execution,  is  of  less  interest.  We 
believe  that  large-scale  parallelism  is  essential  to  future  C3I  applications.  It  must  be 
possible  to  scale  the  D/KBMS  architecture  to  the  size  of  the  problem  being  tackled; 
applications  written  in  the  language  should  not  require  modification  to  take  advantage 
of  an  expanded  D/KBMS  system  configuration. 
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2.1.2.  Base  for  Expert  System  Shell 

Advanced  C  I  information  management  applications  must  perform  many  of  the 
knowledge-based  activities  associated  with  expert  systems:  interpretation,  diagnosis, 
monitoring,  prediction,  and  planning  [Stef82].  Therefore  we  expect  that  these  applica¬ 
tions  will  require  the  full  range  of  facilities  provided  by  expert  system  shells  such  as 
KEE,  ART,  and  LOOPS.  These  facilities  include: 

forward  chaining  (data-driven  reasoning) 

backward  chaining  (goal-driven  reasoning) 

procedural  computation 

object-oriented  knowledge  representation 

evidential  reasoning 

models  of  time  and  hypothetical  worlds 

belief  maintenance 

nonmonotonic  reasoning 

explanation  facilities 

Forward  chaining  is  a  rule-based  inference  mechanism  in  which  the  rules  are  used 
in  a  forward  direction,  from  prerequisites  to  conclusion  or  action,  to  derive  new  infor¬ 
mation  or  a  new  state  from  existing  information  or  the  current  state.  OPS5  [Forg8l]  is 
an  example  of  a  forward  chaining  language.  Forward  chaining  is  also  called  data-driven 
reasoning  because  it  accepts  new  information  and  derives  all  consequences  of  this  infor- 
mation.  In  the  C  l  domain,  forward  chaining  is  useful  in  applications  that  monitor  the 
current  state  of  the  world  for  significant  events  that  may  require  a  response. 

Backward  chaining  is  a  rule-based  mechanism  in  which  rules  are  used  backwards, 
from  conclusion  or  action  to  prerequisites,  to  determine  whether  a  given  statement  or 
goal  can  be  supported  by  existing  information.  Backward  chaining  is  also  called  goal- 
driven  reasoning.  Prolog  [Cloc84]  is  the  best-known  backward  chaining  language. 

Procedural  computation  is  prescriptive  computation  of  results  from  arguments,  as 
found  in  languages  such  as  Pascal  and  LISP. 

Object-oriented  knowledge  representation  models  real  world  objects  and  concepts  as 
objects  in  the  Smalltalk  sense,  i.e.,  combinations  of  data  and  procedures  to  operate  on 
the  data.  Frames  [Barr8l]  are  a  form  of  object-oriented  knowledge  representation.  An 
object’s  data  can  be  accessed  only  via  its  procedures,  which  are  invoked  by  "sending  the 
object  a  message."  Different  objects  can  respond  to  the  same  message  in  their  own  way, 
according  to  their  associated  procedures.  Objects  can  be  combined  into  larger  objects 
in  an  aggregation  hierarchy.  An  inheritance  hierarchy  can  also  be  defined  among 
objects,  so  that  an  object  can  inherit  by  default  the  properties  (procedures  or  data)  of 
an  object  higher  on  the  hierarchy.  An  object-oriented  knowledge  representation  pro¬ 
vides  a  concise  way  of  modeling  many  real  world  situations. 

Evidential  reasoning  is  a  form  of  reasoning  in  which  hypotheses  have  associated 
probabilities  or  certainty  factors.  Mycin  [Shor76],  an  expert  system  for  diagnosing  and 
treating  infectious  diseases,  employs  probabilistic  reasoning.  Evidential  reasoning  can 
be  applied  to  diagnostic  and  planning  tasks  in  the  C  l  domain  as  well. 


Models  of  time  and  hypothetical  worlds  refer  to  the  ability  to  model  not  just  the 
current  state  of  the  world,  but  the  sequence  of  states  that  constitutes  the  past,  and  pos¬ 
sibly  one  or  more  hypothetical  sequences  of  future  states.  Each  state  in  the  sequence  is 
an  incremental  modification  of  the  previous  one,  typically  representing  the  addition  of  a 
new  fact  or  assumption.  Hypothetical  worlds  can  be  used  to  explore  alternative  stra¬ 
tegies  in  a  planning  activity. 

Belief  maintenance  is  a  facility  that  can  be  built  into  a  logic-based  inference  system 
to  record  dependencies  among  propositions,  detect  inconsistencies  among  them,  and  per¬ 
mit  retraction  of  propositions. 

Nonmonotonic  reasoning  is  a  reasoning  system  in  which  a  default  assumption  about 
an  object  or  situation  can  be  made  in  the  absence  of  specific  information.  This  assump¬ 
tion  can  later  be  overridden  when  additional  information  becomes  available.  Retraction 
of  the  assumption  requires  retraction  of  any  conclusions  deduced  from  the  assumption, 
using  some  form  of  belief  maintenance. 

An  explanation  facility  is  a  mechanism  by  which  an  inference  system  displays  how 
it  proved  or  failed  to  prove  a  particular  conclusion,  tracing  the  conclusion  back  via 
inference  rules  to  base  facts. 

The  application  interface  language  need  not  support  these  facilities  directly.  How¬ 
ever,  it  should  be  possible  to  implement  them  reasonably  efficiently  in  the  language. 

2.1.3,  Support  for  Nonprocedural  Database  Access 

Most  expert  systems  today  use  a  relatively  small  amount  of  data  that  can  be  stored 
entirely  in  main  memory.  Advanced  C3I  applications  will  have  large  data  and 
knowledge  bases,  necessitating  an  underlying  database  management  system.  Therefore, 
the  language  must  support  nonprocedural  retrieval  and  update  of  stored  data,  as  in  a 
relational  DBMS,  and  must  permit  efficient  execution  of  traditional  relational  queries. 

2.2.  Alternative  Language  Classes 

The  candidate  languages  can  be  divided  into  three  major  classes:  imperative,  func¬ 
tional,  and  logic.  The  imperative  languages  are  characterized  by  sequences  of  com¬ 
mands  or  statements  that  make  incremental  changes  to  a  global  program  state  con¬ 
tained  in  a  set  of  variables.  Examples  of  imperative  languages  are  traditional  program¬ 
ming  languages  such  as  FORTRAN  and  Pascal,  and  more  modern  object-based 
languages  such  as  Smalltalk  and  CLU. 

Programs  in  functional  languages  are  essentially  definitions  and  applications  of 
functions.  There  is  no  notion  of  operations  on  named  objects,  and  therefore  there  are 
no  side  effects.  Examples  of  functional  languages  include  pure  LISP,  FP,  and  dataflow 
languages  such  as  Val  and  Id.  LISP,  as  it  is  generally  used  today  in  expert  systems  and 
other  applications,  is  an  imperative  rather  than  a  functional  language;  great  use  is  made 
of  side  effects. 

Programs  in  most  logic  programming  languages  are  composed  of  Horn  clauses, 
which  have  the  form 


where  head  is  zero  or  one  atomic  formulas  (predicates  with  arguments  supplied)  and 
body  is  a  conjunction  of  zero  or  more  atomic  formulas.  All  arguments  that  are  vari¬ 
ables  are  implicitly  universally  quantified.  The  logical  interpretation  of  a  Horn  clause  is 
that  the  body  implies  the  head.  For  example,  the  Horn  clause 

a(X,Y):~  b(X,Z),  c(Z,Y). 

means  that  for  all  X,  Y,  and  Z,  b(X,Z)  and  c(Z,Y)  implies  a(X,Y).  (We  use  the  Pro¬ 
log  convention  that  constants  and  predicate  names  begin  with  lower  case  letters,  while 
variables  begin  with  capital  letters.) 

An  empty  Horn  clause  body  is  considered  true.  Therefore  a  Horn  clause  with  an 
empty  body  states  that  the  head  is  always  true.  Such  a  Horn  clause  is  called  a  fact. 
Facts  can  be  written  with  no  implication  sign,  e.g., 

parent{a  ,b). 

means  that  a  is  the  parent  of  b . 

An  empty  Horn  clause  head  is  considered  false.  Therefore  a  Horn  clause  with  an 
empty  head  states  that  the  conjunction  of  atomic  formulas  in  the  clause’s  body  is  false. 
This  refutation  of  the  body  can  be  used  to  initiate  a  resolution-based  proof  that  the 
body  is  in  fact  true  [Cloc84].  In  the  course  of  this  proof,  all  variable  instantiations  that 
make  the  body  true  can  be  discovered.  A  Horn  clause  with  no  head  is  therefore  called  a 
goal  or  query. 

Horn  clauses  do  have  limited  expressiveness:  it  is  difficult  to  express  indefinite 
information  such  as  "object  A'  is  either  an  airplane  or  a  missile."  However,  permitting 
the  expression  of  indefinite  information  makes  proof  procedures  much  less  efficient. 

Logic  languages  also  give  a  procedural  interpretation  to  a  Horn  clause:  if  the  head 
is  a  goal,  invoke  the  atomic  formulas  of  the  body  as  subgoals.  Logic  languages  differ 
from  functional  languages  in  that  the  predicates  can  (generally)  be  used  in  more  than 
one  direction.  The  following  are  all  acceptable  goals: 

:-  parcnt(a  ,b). 

parent  (X  ,b). 

:-  parent(a  ,Y). 

parent(X  ,Y). 

Predicates  that  can  be  used  in  more  than  one  direction  are  called  multi-use  programs. 
In  contrast,  imperative  and  functional  languages  compute  in  only  one  direction,  from 
argument  to  result. 

Each  language  class  is  superior  to  each  other  class  with  respect  to  at  least  one  of 
our  requirements.  Imperative  languages,  particularly  LISP,  form  the  basis  of  current 
expert  system  shells,  and  therefore  demonstrably  support  the  facilities  listed  earlier. 
Expert  system  shells  based  on  logic  or  functional  languages,  e.g.  APES  [Hamm82],  are 
not  nearly  as  highly  developed.  Furthermore,  many  imperative  languages  have  been 
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extended  with  interfaces  to  database  management  systems.  The  differences  between  the 
common  imperative  languages  and  relational  data  manipulation  languages  make  these 
interfaces  awkward  and  inefficient,  however. 

Current  imperative  languages  discourage  exploitation  of  parallelism.  They  typi¬ 
cally  require  explicit  specification  of  concurrent  processes,  and  hence  make  it  difficult  to 
achieve  parallelism  proportional  to  the  amount  of  data  being  processed.  In  contrast, 
parallelism  is  often  implicit  in  functional  and  logic  languages.  Furthermore,  imperative 
languages  often  permit  uncontrolled  side  effects  that  preclude  partitioning  a  problem 
into  independent  subcomputations.  We  view  exploitation  of  parallelism  as  the  essence 
of  the  research  to  be  done.  Therefore,  we  have  excluded  imperative  languages  from 
consideration  as  the  parallel  inference  language. 

This  is  not  to  say  that  large-scale  parallelism  cannot  be  achieved  using  imperative 
languages.  Object-oriented  principles  can  be  used  to  limit  side  effects  and  therefore  pro¬ 
mote  parallelism.  Parallel  object-oriented  languages  are  a  current  research  area. 

Functional  languages  have  implicit  parallelism  in  the  concurrent  evaluation  of 
arguments  to  a  function.  The  U-Interpreter  for  the  Id  language  also  exhibits  large-scale 
parallelism  [Arvi82].  However,  the  lack  of  side  effects  in  functional  languages  makes  it 
impossible  to  do  such  basic  operations  as  update  a  database.  For  example,  the  FQL 
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language  [Bune82]  is  a  functional  language  designed  for  querying  databases.  It  models 
stored  data  as  functions,  and  has  a  syntax  similar  to  FP.  However,  it  has  no  update 
capability.  For  this  reason,  we  have  eliminated  functional  languages  as  unsuitable  to 
database-oriented  applications. 

This  leaves  logic  languages.  Logic  languages  provide  both  nonprocedural  database 
access  and  the  opportunity  for  large-scale  parallelism. 

A  logic  language  can  provide  the  functionality  of  a  relational  database  system  in 
the  following  way.  Relations  can  be  represented  as  sets  of  facts.  Queries  are  simply 
goals.  Views  (derived  relations)  can  be  defined  using  ordinary  Horn  clauses,  as  illus¬ 
trated  below.  Updates  can  be  accomplished  using  built-in  predicates  that  have  the  side 
effect  of  adding  or  deleting  clauses.  It  is  easy  to  show  that  a  logic  language  with  the 
semantics  defined  above  is  relationally  complete. 

Three  forms  of  parallelism  have  been  identified  for  logic  languages:  and - 
parallelism,  stream  parallelism,  and  or-parallelism.  And-parallelism  is  the  parallel  reso¬ 
lution  of  different  literals  in  the  body  of  a  Horn  clause.  This  re:.  lution  must  be  coordi¬ 
nated  when  the  literals  share  logic  variables.  Stream  parallelism  is  a  form  of  and- 
parallelism  involving  parallel  binding  and  use  of  a  structured  value,  typically  a  list. 
Or-parallelism  is  parallel  pursuit  of  alternative  proofs  of  a  literal. 

And-  and  or-parallelism  have  analogs  in  relational  database  terms.  Consider  a 
database  with  relations 


emp(Eno,  Ename  ,  Date_of_birth ,  Dno) 
dept(Dno,  Dname  .  Location  ) 

The  SQL  query  to  find  the  names  of  all  employees  in  the  Sales  department  is 
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SELECT  Ename  FROM  emp,  dept 

WHERE  emp.Dno  =  dept.Dno  AND  dept.Dname  =  "SALES" 
A  relational  algebra  tree  to  compute  this  query  is  shown  in  Figure  2.1. 
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Figure  2.1.  Relational  Algebra  Tree 

The  equivalent  query  in  a  logic  language  would  be 

emp_dept{Eno ,  Ename,  Dname)  emp(Eno,  Ename,  _,  Dno), 

dept(Dno,  Dname,  _). 

emp_dept(_,  emp,  'SALES'). 

(Here,  we  have  defined  the  emp_dept  view  before  issuing  the  query.)  And-parallelism  in 
the  resolution  of  the  first  clause’s  body  is  realized  by  concurrent  evaluation  of  the  two 
independent  branches  of  the  relational  algebra  tree,  with  the  join  operation  implement¬ 
ing  the  coordination  of  the  shared  logic  variable  Dno.  Or-parallelism  is  realized  by  pro¬ 
cessing  multiple  tuples  at  a  time  in  the  select,  project,  and  join  operations.  Stream 
parallelism  appears  to  have  no  analog  in  relational  database  systems  because  variables 
are  bound  to  unstructured  values  such  as  numbers  and  strings,  not  lists  or  other  struc¬ 
tures  that  could  be  used  as  they  are  generated. 
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2.3.  Alternative  Logic  Languages 

We  decided  to  concentrate  on  logic  languages  because  they  provide  nonprocedural 
database  access  and  the  potential  for  large-scale  parallelism.  We  considered  three  logic 
languages:  Prolog,  Concurrent  Prolog  (Shap83j,  and  PARLOG. 

The  semantics  of  Prolog  are  defined  in  terms  of  a  sequential  execution  model.  In 
this  model,  the  order  of  the  clauses  in  a  Prolog  "database"  is  significant;  the  database  is 
scanned  sequentially  from  top  to  bottom  when  attempting  to  satisfy  a  goal.  Within  a 
clause,  the  atomic  formulas  in  the  body  are  satisfied  left  to  right  in  order.  Finally,  the 
cut  operator,  when  executed,  prevents  searching  for  later  clauses  to  satisfy  the  goal  that 
the  current  clause  is  attempting  to  satisfy. 

This  top-down,  left-right  execution  model  does  not  exploit  and-,  or-,  or  stream 
parallelism.  Referring  to  the  emp_dept  query  above,  the  top-down  search  rule  prevents 
an  or-parallel  search  of  the  dept  tuples  to  find  the  "SALES"  department.  It  also 
prevents  an  and-parallel  execution  of  the  projection  of  emp  and  selection  of  dept, 
should  this  be  the  chosen  evaluation  strategy. 

The  Prolog  execution  model  dictates  a  particularly  inefficient  execution  of  queries 
involving  joins,  such  as  the  emp_dept  query  shown  above.  The  join  between  the  emp 
and  dept  relations  is  executed  essentially  as  a  nested  loop  join  with  emp  as  the  outer 
relation  and  dept  as  the  inner  relation.  The  execution  time  of  this  join  is  proportional 
to  the  product  of  the  relation  sizes.  Joins  can  often  be  performed  much  more  efficiently 
using  a  sort-merge  or  hash-based  algorithm  [Vald84].  The  inefficiency  of  the  nested  loop 
join  relative  to  other  algorithms  increases  with  the  size  of  the  relations.  Even  if  a 
nested  loop  join  is  to  be  used,  using  the  smaller  relation  (presumably  dept  in  this  case) 
as  the  outer  relation  results  in  faster  execution  when  the  relation  tuples  are  stored  on 
disk. 

It  is  possible  to  analyze  Prolog  programs  to  detect  instances  where  a  clever  execu¬ 
tion  strategy,  such  as  an  efficient  join  or  parallel  execution  of  clauses,  produces  the  same 
result  as  the  standard  execution  model.  An  optimizing  Prolog  compiler  could  perform 
this  analysis  and  choose  the  clever  strategy  where  possible.  However,  we  believe  that 
optimizing  a  sequential  language  for  parallel  execution  is  inferior  to  starting  with  a 
language  designed  for  parallel  execution  in  the  first  place,  assuming  that  such  a 
language  is  available. 

A  further  problem  with  Prolog  is  that  it  does  not  always  provide  the  natural  fixed 
point  semantics  for  execution  of  a  set  of  Horn  clauses.  This  occurs  when  the  Horn 
clauses  are  recursive.  For  instance,  consider  the  following  definition  of  the  ancestor 
relation: 

ancestor(X ,Y)  parent(X ,Y). 

ancestor (X ,Y)  ancestor(X ,Z),  parent{Z ,Y). 

The  definition  of  ancestor  is  left-recursive;  application  of  Prolog  execution  rules  leads  to 
an  infinite  ioop.  This  is  a  problem  with  Prolog  semantics  and  not  with  any  implemen¬ 
tation  of  those  semantics:  a  clever  execution  strategy  does  not  alleviate  the  problem. 
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Having  eliminated  Prolog,  we  examined  two  variants  of  Prolog  designed  for  parallel 
execution:  Concurrent  Prolog  and  PARLOG.  Concurrent  Prolog  presents  a  third 
interpretation  of  logic  programs  in  addition  to  the  declarative  and  procedural  interpre¬ 
tations:  atomic  formulas  can  be  executed  as  processes.  A  Horn  clause  represents  an 
expansion  of  a  process  (the  predicate  in  the  consequent)  into  a  set  of  processes  (the 
predicates  in  the  body.)  Processes  communicate  with  each  other  via  shared  logic  vari¬ 
ables.  This  is  clear  and-parallelism.  Stream  parallelism  is  realized  when  one  process 
binds  a  logic  variable  to  a  structured  value  (typically  a  list),  producing  the  structure  as 
another  process  consumes  it.  A  synchronization  mechanism  called  read-only  variables  is 
used  to  delay  the  consumer  process  when  it  attempts  to  reference  a  variable  that  the 
producer  process  has  not  yet  bound.  These  concepts  appeared  in  an  earlier  logic 
language,  the  "Relational  Language"  [Clar8lj. 

In  this  interpretation  of  logic  programs,  there  is  no  generation  of  alternative  proofs 
for  a  goal  via  or-parallelism  or  backtracking.  Concurrent  Prolog  checks  multiple  clauses 
in  parallel  for  applicability,  and  then  nondeterministically  chooses  one  of  them.  A  com¬ 
mon  example  is  nondeterministic  merge  [Shap83j.  The  merge  relation  interleaves  its 
first  two  arguments  nondeterministically  to  produce  the  third  argument: 

merge([X  !  ATs],  Ys  ,[X  -  Zs\)  :  -  merge  (Xs? ,  Ys? ,Zs ). 

merge  (Xs ,[  1'  I  Ys]\Y  •  Zs\)  :  —  merge(Xs?  ,Ys?  ,Zs). 

merge  (Xs  ,[],Xs). 

merge  ([],  Ks ,  Fs ). 

(The  question  marks  indicate  read-only  variables.)  Concurrent  Prolog  has  no  back¬ 
tracking:  once  the  choice  is  made  between  the  first  two  clauses,  the  choice  is  never 
revisited.  The  merge  relation  produces  some  merging  of  its  input  arguments,  not  all  of 
fnus,  the  semantics  of  Concurrent  Prolog,  while  appropriate  for  the  construction 
oncurrent  systems,  are  quite  different  from  fixpoint  semantics  and  therefore  inap- 
piupria'-'  for  database  retrieval. 

PARLOG,  in  contrast,  defines  two  different  kinds  of  relations  (predicates):  single- 
solutton  relations  and  all-solutions  relations.  Single-solution  relations  are  executed  using 
the  parallel  process  semantics  similar  to  those  of  Concurrent  Prolog.  However,  their 
semantics  have  been  defined  to  eliminate  runtime  management  of  multiple  binding 
environments.  Each  single-solution  relation  must  have  a  mode  declaration  that  indicates 
which  of  the  relation’s  arguments  are  inputs  and  which  are  outputs.  For  example,  the 
following  relation  computes  the  sum  of  the  elements  on  a  list: 

mode  sum(List? ,Sum'). 

sum  ([],0). 

sum  ([A  i  Xs],S  1)  :-  sum(Xs  ,S),  Si  is  S  +  X. 

Single  solution  relations  are  thus  basically  functions  that  compute  results  from  argu¬ 
ments.  They  provide  a  procedural  computation  facility  similar  to  LISP.  It  is  possible, 
though  awkward,  to  specify  single  solution  relations  that  compute  in  more  than  one 
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direction.  (PARLOG  does  provide  nondeterminism  and  logic  variables,  both  of  which 
are  absent  in  LISP.) 

All-solutions  relations  are  defined  using  pure  Horn  clauses  that  can  be  executed 
with  fixpoint  semantics.  All  solutions  relations  have  no  mode  declarations,  and  can  pro¬ 
vide  nonprocedural  database  access  as  illustrated  earlier. 

The  interface  between  the  two  kinds  of  relations  is  provided  by  the  "set"  construc¬ 
tor,  which  constructs  a  list  of  the  results  of  an  all-solutions  relation  query.  This  list  can 
then  be  manipulated  by  single-solution  relations.  For  example,  the  following  clauses 
computes  the  sum  of  the  ages  of  Sam’s  children. 

sum(Ages ,  Number). 

set(Ages,  A,  ( parent(Sam ,  X),  age(X,A))). 

Here,  Ages  is  the  list  generated  by  the  set  constructor.  It  is  a  list  of  terms  A ,  where  A 
must  satisfy  parent(Sam, X)  and  age(X,A)  for  some  child  X.  The  combination  of  capa¬ 
bilities  provided  by  single-  and  all-solutions  relations  makes  PARLOG  attractive  for 
complete  applications.  Section  2.6  gives  a  more  complete  description  of  PARLOG. 

2.4.  PARLOG  and  the  Language  Requirements 

Let  us  review  PARLOG  with  respect  to  the  language  requirements  stated  earlier. 

2.4.1.  Large-Scale  Parallelism 

Large-scale  parallelism  in  single-solution  relations  is  hindered  by  the  lack  of  an 
array  or  similar  direct-access  data  structure  in  PARLOG.  (Other  logic  languages  have 
the  same  limitation.)  Data  that  might  be  organized  as  an  array  in  a  conventional 
language  must  be  organized  as  a  list  (or  perhaps  as  a  tree,  for  faster  access)  in  a  logic 
language.  In  order  to  perform  some  operation  on  all  elements  of  a  list,  the  list  must  be 
traversed  element  by  element,  spawning  a  process  to  perform  the  desired  operation  on 
each  element,  as  in  the  sum  example  above.  This  list  traversal  time,  which  is  linear  in 
the  number  of  elements,  may  dominate  the  total  processing  time.  Organizing  the  data 
as  a  tree  can  reduce  the  traversal  time  to  be  logarithmic  in  the  amount  of  data,  assum¬ 
ing  enough  processors. 

Large-scale  parallelism  can  be  achieved  in  all-solutions  relations  using  or-parallelism 
in  a  multiprocessor  architecture.  For  instance,  the  Teradata  database  machine  parti¬ 
tions  the  tuples  of  each  relation  across  an  array  of  processor/disk  pairs  [Tera83].  The 
common  relational  operations  can  then  be  executed  in  parallel  in  these  partitions.  An 
implementation  of  all-solutions  relations  could  make  use  of  similar  techniques. 

2.4.2.  Base  for  Expert  System  Shell 

Forward  chaining:  Existing  logic-based  languages,  including  PARLOG,  have  no  built-in 
forward  chaining  facility.  However,  forward  chaining  can  be  implemented  on  top  of 
logic  languages  [Subr85]. 
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Backward  chaining:  PARLOG,  like  Prolog,  provides  backward  chaining  for  goal- 
directed  reasoning.  When  multiple  alternatives  must  be  explored  to  find  a  proof  for  a 
goal  (the  usual  case),  ail-solutions  relations  should  be  used. 

Procedural  computation:  PARLOG’s  single-solution  relations  can  be  used  as  functions, 
since  each  argument  of  a  relation  must  be  declared  as  input  or  output.  These  declara¬ 
tions  permit  more  efficient  implementation  of  procedural  computation  than  would  be 
the  possible  in  Prolog.  However,  PARLOG  lacks  structured  data  types,  such  as  arrays 
and  records,  that  are  found  in  conventional  programming  languages,  though  they  could 
be  added.  In  this  respect,  PARLOG  is  probably  no  worse  than  early  dialects  of  LISP. 
Each  of  these  features  can  be  mimicked  in  PARLOG.  For  instance,  arrays  can  be 
implemented  (inefficiently)  using  relations.  However,  the  lack  of  direct  support  makes 
PARLOG  a  poor  vehicle  for  applications  that  make  heavy  use  of  them.  To  overcome 
this  problem,  an  interface  between  PARLOG  and  the  C  language  has  already  been 
developed  to  support  multiprocessing  for  numeric  applications  [Butl85]. 

Object-oriented  knowledge  representation:  Shapiro  and  Takeuchi  have  shown  that 
object-oriented  programming  can  be  performed  in  Concurrent  Prolog;  similar  techniques 
work  for  PARLOG,  using  single-solution  relations.  Each  object  is  represented  by  a  per¬ 
petual  process  that  receives  a  stream  of  messages  as  input  and  generates  a  stream  of 
messages  as  output.  The  state  of  the  object  is  maintained  in  logic  variables  local  to  the 
process.  While  this  implementation  may  be  functionally  adequate,  the  cost  of  dedicat¬ 
ing  a  process  to  each  object  may  be  excessive.  The  execution  model  for  PARLOG 
single-solution  relations  presented  in  the  next  chapter  is  designed  to  minimize  the  over¬ 
head  of  process  creation  and  termination.  It  may  also  be  possible  to  develop  compile¬ 
time  techniques  to  reduce  this  overhead. 

Evidential  reasoning,  belief  maintenance,  nonmonotonic  reasoning,  and  explanation  facil¬ 
ity:  PARLOG  provides  no  direct  support  for  these  facilities.  However,  their  implemen¬ 
tation  in  PARLOG  appears  relatively  straightforward  [Subr85]. 

2.4.3.  Support  for  Nonprocedural  Database  Access 

As  stated  earlier,  PARLOG  provides  a  relationally  complete  nonprocedural 
language  using  its  all-solutions  relations.  It  goes  beyond  relational  completeness  by  pro¬ 
viding  recursive  queries.  Aggregate  functions  can  be  implemented  easily  using  a  combi¬ 
nation  of  single-  and  all-solutions  relations,  as  illustrated  in  the  program  for  summing 
the  ages  of  Sam’s  children. 

To  summarize,  PARLOG  is  not  uniformly  superior  to  all  other  languages  with 
respect  to  our  requirements.  In  many  cases,  such  as  procedural  computation  and 
object-oriented  knowledge  representation,  other  languages  are  clearly  superior  to  PAR- 
LOG.  However,  we  are  not  aware  of  another  language  that  meets  the  entire  set  of 
requirements  better  than  PARLOG.  Parallel  languages  for  symbolic  computation  are 
an  active  research  area,  and  we  expect  better  languages  to  emerge  in  the  next  few  years. 
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2.5.  An  Overview  of  PARLOG 

In  this  section,  we  provide  an  overview  of  PARLOG.  For  a  more  detailed  exposi¬ 
tion,  the  reader  is  referred  to  [Greg85],  and  [Clar86].  As  mentioned  in  section  2.3, 
PARLOG  features  two  kinds  of  relations:  single-solution  relations,  and  all-solutions  rela¬ 
tions.  All-solutions  relations  can  appear  in  the  body  of  single-solution  relations.  The 
evaluation  semantics  for  these  relations  are  completely  different.  Indeed,  they  can  be 
viewed  as  two  different  languages  —  all-solutions  relations  constituting  a  query  language 
and  single-solution  relations  a  parallel  applications  programming  language.  An  all¬ 
solutions  relation  query  can  be  evaluated  by  computing  the  least  fixed  point  of  the  Horn 
clauses  defining  this  relation.  Thus,  PARLOG  can  be  considered  as  having  a  Horn 
clause  query  language  embedded  within  it. 

Single-solution  relations  compute  just  a  single  solution  to  a  query.  They  provide  a 
procedural  computation  facility  similar,  but  not  identical,  to  that  provided  by  func¬ 
tional  programming  languages.  The  difference  arises  due  to  the  logical  variable,  which  is 
basically  a  variable  in  a  logic  programming  language.  However,  terms  bound  to  such 
variables  may  be  only  partially  instantiated,  i.e.,  they  may  contain  variables.  If  such 
terms  appear  in  input  argument  positions  of  relation  calls,  the  call  may  bind  the  unin¬ 
stantiated  variables.  This  is  in  contrast  to  functional  programming  languages,  where 
arguments  of  a  function  call  are  fully  instantiated.  Also,  in  functional  programming 
languages,  evaluation  of  a  function  cannot  produce  partially  instantiated  data  struc¬ 
tures.  All-solutions  relations  compute  all  the  solutions  to  a  query.  Therefore,  they  are 
suitable  for  non-procedural  access  to  databases.  We  describe  the  two  halves  of  the 
language  below. 

2.5.1.  Single-solution  Relations 

A  single-solution  relation  consists  of  a  mode  declaration,  and  a  set  of  guarded 
clauses.  A  mode  declaration  identifies  arguments  of  a  relation  as  being  inputs  or  out¬ 
puts.  For  example,  the  mode  declaration  R(? ,  “)  specifies  that  the  first  argument  of 
relation  R  is  an  input,  while  its  second  argument  is  an  output. 

A  guarded  clause  is  a  clause  of  the  form: 
head  -  guard  :  body 

where  head  is  a  literal,  and  where  guard  and  body  are  possibly  empty  conjunctions  of 
literals.  If  the  guard  is  empty,  the  operator  is  not  present.  A  literal  is  a  tuple 
prefixed  by  a  relation  name.  An  example  of  a  guarded  clause  is  the  following: 

R(tv  ■  ■  ■  ,tj  -  Gl(pv  ■  ■  •  pj,  •  •  •  Gm{qv  •  •  •  qt)  : 

B{[rv  ■  ■  '  rj,  •  •  •  Bp{wv  ■  •  •  u>J 

PARLOG  features  both,  sequential  and  parallel  conjunctions.  An  example  is  the 
following. 

(R{&  (R2,R3)),  Rr  R. 

The  sequential  conjunction  operator  "&n  indicates  that  literals  following  it  are  to  be 
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evaluated  only  after  all  previous  literals  have  succeeded.  On  the  other  hand,  the  paral¬ 
lel  conjunction  operator  indicates  that  a  separated  group  of  literals  may  be 
evaluated  in  parallel.  Thus,  in  the  example  above,  R2  and  i?3  may  be  evaluated  in 
parallel,  but  only  after  R  j  has  been  successfully  evaluated.  Thus,  the  and 

operators  have  a  control  significance— they  dictate  the  order  in  which  literals  in  a  con¬ 
junction  are  to  be  evaluated. 

Mode  declarations  impose  a  directionality  on  logic  programs.  If  a  particular  argu¬ 
ment  position  has  a  mode  annotation  "?",  then  a  non-variable  term,  say  [a  I  6],  appear¬ 
ing  in  that  argument  position  in  the  head  of  a  clause  can  be  used  only  for  input  match¬ 
ing.  That  is,  the  call  argument  for  that  position  should  be  a  substitution  instance  of 
[a  I  b]. 

Similarly,  if  a  particular  argument  position  has  a  mode  annotation  " ' then  the 
call  argument  for  that  position  must  an  uninstantiated  variable,  else  we  have  run-time 
error.  A  non-variable  term  appearing  in  that  argument  position  in  the  head  of  clause 
can  be  used  only  for  output  matching. 

The  principal  advantage  of  mode  declarations  is  that  most  of  the  unification  in 
PARLOG  can  be  compiled.  The  details  are  explained  in  [GregSoj. 

If  a  call  argument  is  instantiated  enough  to  be  able  to  determine  that  it  is  not  a 
substitution  instance  of  a  non-variable  term  appearing  in  an  input  argument  position  in 
the  head  of  a  clause,  the  attempt  to  use  that  clause  is  aborted.  However,  if  the  call 
argument  is  not  yet  instantiated  enough  to  be  able  to  make  a  decision,  the  attempt  to 
use  the  clause  is  suspended. 

A  clause  is  called  a  candidate  clause  for  a  relation  call  if  the  head  of  the  clause 
input  matches  with  the  call  on  all  the  input  arguments,  and  if  it  has  a  successfully  ter¬ 
minating  guard.  A  clause  is  called  a  non-candidate  clause  if  either  or  both  these  condi¬ 
tions  are  false.  As  described  before,  input  matching  may  cause  suspension.  In  this  case, 
the  clause  cannot  yet  be  classified  as  a  non-candidate  clause. 

The  set  of  clauses  defining  a  relation  may  be  separated  either  by  a  operator,  or 
a  operator.  For  example,  a  relation  R  may  be  defined  by  a  set  of  five  clauses,  com¬ 
posed  like  so: 

(<?,;  (C,  •  c3i)  ■  Ct  •  cs 

The  and  operators  control  the  order  in  which  the  set  of  clauses  are  to  be  tried  in 
order  to  find  a  candidate  clause.  A  indicates  that  clauses  following  it  are  to  be  tried 
only  if  all  of  the  preceding  clauses  prove  to  be  non-candidate  clauses.  However,  all 
clauses  in  a  separated  group  of  clauses  may  be  tried  in  parallel.  Thus,  in  the  exam¬ 
ple  above,  the  clauses  C2  and  C3  may  be  tried  in  parallel,  but  only  after  C x  proves  to 
be  a  non-candidate  clause.  Thus,  just  like  the  and  operators,  the  and  *.* 
operators  have  a  control  significance— they  dictate  how  a  solution  to  a  query  is  to  be 
found. 

The  evaluation  of  a  relation  call  starts  out  with  an  attempt  to  find  a  candidate 
clause,  in  the  order  specified  by  the  and  operators.  The  implementation  is  free 
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to  choose  any  candidate  clause.  When  it  picks  a  candidate  clause,  it  is  said  to  commit 
to  that  clause,  i.e.  the  body  of  this  clause  is  evaluated  to  find  one  solution  to  the  call. 
There  is  no  backtracking  on  this  choice.  This  is  referred  to  as  committed  choice  non¬ 
determinism.  The  operator,  separating  the  guard  and  body  of  a  guarded  clause,  is 
referred  to  as  the  commit  operator. 

Guards  of  clauses  being  tried  during  the  search  for  a  candidate  clause,  are  not 
allowed  to  bind  variables  in  the  call  until  commitment.  Guards  that  don’t  bind  vari¬ 
ables  in  the  call  are  called  safe  guards.  PARLOG  enjoys  the  advantage  that  the  safety 
of  guards  can  be  checked  at  compile  time— a  property  that  is  significantly  conducive  to 
efficient  implementation.  The  compile  time  safety  check  obviates  the  need  to  maintain, 
at  run-time,  multiple  environments  for  the  different  guard  evaluations  and  export  them 
upon  commitment,  as  is  the  case  in  Concurrent  Prolog  [Shap83]. 

Binding  of  call  variables  is  made,  in  the  form  of  output  matching,  upon  commit¬ 
ment.  Since  there  is  no  backtracking  on  the  choice  of  the  candidate  clause,  bindings 
made  to  variables  never  need  be  retracted.  Thus,  variables  in  PARLOG  have  the 
single-assignment  property. 

We  now  describe  the  distinguishing  attribute  of  single-solution  relations,  viz., 
stream  -AND  parallelism.  This  form  of  parallelism  arises  when  there  are  multiple  rela¬ 
tion  calls  working  concurrently  on  evaluating  the  same  solution.  They  communicate  by 
passing  bindings  through  shared  variables.  It  is  to  be  contrasted  with  all-solutions 
AND  parallelism,  which  arises  when  multiple  solutions  to  a  query  are  available. 
Stream- AND  parallelism  requires  that  no  more  than  one  solution  to  a  call  be  computed. 
If  such  is  the  case,  the  solution  can  be  generated  incrementally,  via  a  series  of  approxi¬ 
mations.  If  an  approximate  solution  is  never  retracted,  it  can  be  communicated  immedi¬ 
ately  to  other  calls,  thus  making  stream- AND  parallelism  easy  to  implement.  Single¬ 
solution  relations  in  PARLOG  feature  this  form  of  parallelism  since  shared  variables  in 
such  relations  have  the  single-assignment  property,  and  since  only  one  solution  to  such 
relations  is  computed  (due  to  committed  choice  non-determinism).  Stream  -AND  paral¬ 
lelism  is  typically  used  in  the  construction  of  a  list— different  portions  of  it  would  be 
instantiated  by  different  relation  calls  working  concurrently. 

Object  oriented  programming  is  possible  using  single-solution  PARLOG  relations. 
This  is  because,  unlike  functions,  such  relations  can  bind  variables  in  their  input  argu¬ 
ment  terms.  Whenever  a  single-solution  relation  does  this,  a  back  communication  to  the 
calling  relation  occurs.  This  mechanism  is  the  basis  for  object-oriented  programming  in 
PARLOG:  objects  are  implemented  as  relations,  a  message  is  a  partially  bound  term 
given  as  input  to  a  relation,  and  a  reply  is  sent  by  completing  the  term  binding. 

Like  other  logic  programming  languages,  PARLOG  features  the  metacall,  which 
allows  data  to  be  executed  as  programs. 

Any  PARLOG  program  can  be  automatically  translated  into  a  program  in  a  lower 
level  language,  called  Kernel-PARLOG.  It  is  the  first  step  in  the  compilation  of  PAR- 
LOG  programs.  See  [Greg85j  for  the  details.  We  have  used  Kernel-PARLOG  as  the 
basis  of  our  work  in  parallel  inference  architectures. 
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Kernel-PARLOG  does  not  have  any  mode  declarations.  Instead,  the  mode  declara¬ 
tions  in  PARLOG  programs,  are  used  to  compile  input  and  output  matches  to  explicit 
one  way  unification  (<=)  calls.  In  the  input  match  call,  nt  <=t\  the  left  argument  is 
a  non-variable  term,  and  the  right  argument  is  a  variable.  During  evaluation  of  the 
call,  variables  in  nt  are  bound,  so  as  to  make  nt  and  v  syntactically  identical.  The  call 
suspends  if  it  can  proceed  only  when  variables  in  v  get  instantiated.  In  the  output 
match  call,  v  <=  nt,  v  must  be  an  uninstantiated  variable  at  the  time  of  the  call,  in 
which  case  it  is  bound  to  nt.  Otherwise,  there  is  a  run-time  error. 

2.5.2.  All-Solutions  Relations 

As  the  name  implies,  all-solutions  relations  compute  all  solutions  to  a  query.  In  fact 
they  compute  a  list  of  all  the  solutions.  This  list  may  be  consumed  by  a  single-solution 
relation  call.  As  discussed  in  section  2.3,  the  set  constructor  is  the  interface  between 
single-solutions  relations  and  all-solutions  relations.  The  operational  semantics  of  the 
set  constructor  are  not  specified.  Thus,  the  PARLOG  programmer  may  make  no 
assumption  about  the  order  in  which  the  solutions  are  computed.  This  allows  consider¬ 
able  flexibility  in  the  implementation  of  the  set  constructor.  One  possibility  is  a 
Prolog-style  backtracking  evaluation.  Another  possibility  is  to  have  sets  of  solutions 
computed  independently,  and  then  doing  a  join  operation  on  them. 

2.6.  Conclusion 

In  this  chapter,  we  have  presented  our  investigation  of  the  alternatives  for  the 
application  interface  language  for  a  D/KBMS.  This  language  forms  the  basis  for 

o 

developing  C  I  applications  such  as  planning,  monitoring,  threat  assessment,  interpreta¬ 
tion,  etc.  We  identified  three  major  requirements  for  this  language.  It  must  be  amen¬ 
able  to  large  scale  parallelism,  be  a  suitable  base  for  implementing  an  expert  system 
shell,  and  support  efficient  nonprocedural  database  access.  Among  the  various  impera¬ 
tive,  logic,  and  functional  languages  that  we  evaluated,  PARLOG  best  met  these 
requirements.  The  next  chapter  describes  our  investigation  of  parallel  architectures  for 
executing  PARLOG. 
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CHAPTER  3 

Parallel  Architectures  for  the  D/KBMS  Application  Interface 


In  this  chapter,  we  describe  our  investigation  of  parallel  architectures  for  executing 
PARLOG,  our  choice  for  the  D/KBMS  application  interface  language.  Our  approach  to 
this  problem  was  the  following: 

•  Design  a  parallel  abstract  machine  for  PARLOG.  The  abstract  machine  is  not  a 
physical  hardware  architecture.  Its  purpose  is  to  lay  bare  all  the  functions  that 
need  to  be  performed  to  execute  PARLOG  programs  as  well  as  the  data  objects 
created  and  manipulated  by  these  functions. 

•  Develop  a  simulator  for  the  parallel  abstract  machine.  The  simulator  serves  as  an 
implementation  for  the  language.  The  simulator  basically  implements  the  func¬ 
tions  and  the  interactions  between  them.  It  serves  as  a  tool  for  collecting  data  on 
the  execution  behavior  of  the  language,  and  analyzing  this  data. 

•  Analyze  the  run-time  execution  behavior  of  PARLOG  programs  using  the  simula¬ 
tor.  The  analysis  focuses  on  questions  such  as:  How  much  parallelism  is  there? 
What  is  the  granularity  of  the  parallelism'1  What  are  the  communication  patterns? 
What  operations  are  performed  most  frequently?  Do  they  require  hardware  sup¬ 
port0  If  so,  what  kind0  etc. 

•  Design  a  suitable  parallel  hardware  architecture  (both  interconnection  network  and 
processing  element)  for  implementing  the  functions,  and  map  the  abstract  machine 
(i.e.,  the  functions)  onto  it. 

This  approach  was  motivated  by  the  fact  that  while  the  area  of  parallel  architec¬ 
tures  for  concurrent  logic  programming  languages  enjoys  vigorous  activity  (see 
[Ito85,  Mill84,  Greg85,  Hali84] ),  it  is  still  very  much  a  research  area  and  that  much  work 
still  needs  to  be  done.  In  particular,  the  run-time  execution  behavior  of  concurrent 
logic  programs  needs  to  be  studied.  We  believe  that  the  chances  of  obtaining  high  per¬ 
formance  are  greatly  enhanced  if  architectural  decisions  are  based  on  an  analysis  of  the 
run-time  execution  behavior  of  programs.  Such  an  analysis  will  ensure  that  the  architec¬ 
ture  is  well  matched  to  the  language  semantics,  which  in  turn,  is  very  important  for 
high  performance. 

Carrying  out  all  the  above  steps  is  a  major  effort  in  itself.  However,  as  part  of  the 
VLPDF  effort,  we  completed  the  first  two  steps  and  briefly  investigated  the  feasibility  of 
using  the  Connection  Machine  for  executing  PARLOG  by  studying  the  mapping  of  the 
abstract  machine  onto  the  Connection  Machine. 

This  chapter  presents  the  results  of  the  above  work  and  is  organized  as  follows. 
Section  3.1  describes  a  parallel  computational  model  for  PARLOG  to  motivate  the 
design  of  the  abstract  machine.  Section  3.2  describes  the  abstract  machine.  Section  3.3 
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describes  our  investigation  of  the  feasibility  of  mapping  this  abstract  machine  onto  the 
Connection  Machine  architecture.  Section  3.4  summarizes  our  conclusions  from  this 
part  of  the  VLPDF  investigations. 

3.1.  A  Parallel  Computational  Model  for  PARLOG 

In  this  section,  we  describe  a  parallel  computational  model  for  PARLOG  in  order 
to  motivate  the  design  of  the  abstract  machine.  Specifically,  we 

•  describe  the  data  objects  created  and  manipulated  during  execution  of  PARLOG 
programs, 

•  describe  how  PARLOG  programs  are  represented,  and 

•  describe  what  operations  need  to  be  performed  in  order  to  execute  such  programs. 
The  description  of  data  objects  includes: 

•  a  description  of  the  different  types  of  data,  both  scalar  and  structured,  featured  in 
PARLOG; 

•  a  description  of  how  data  objects  are  represented; 

•  a  description  of  how  they  will  be  addressed— globally  addressed  in  a  shared  memory 
system,  or  locally  addressed  in  a  loosely-coupled  system;  and 

•  a  description  of  how  data  objects  are  aggregated  to  form  ever  more  complex 
objects. 

The  description  of  operations  includes: 

•  a  description  of  the  PARLOG  control  structures, 

•  a  description  of  how  these  operations  are  scheduled  for  execution, 

•  a  description  of  how  these  operations  are  synchronized,  and 

•  a  description  of  how  data  objects  are  created. 

3.1.1.  PARLOG  Control  Structures 

There  are  basically  four  control  structures  in  PARLOG:  sequential  conjunction , 
parallel  conjunction,  sequential  search,  and  parallel  search.  The  abstract  AND/OR  pro¬ 
cess  model  for  PARLOG  proposed  by  Gregory  in  chapter  6  of  his  dissertation  [Greg85] 
is  an  ideal  vehicle  for  understanding  the  control  structures  of  PARLOG.  It  overcomes 
the  weakness  of  the  AND/OR  tree  representation— a  graphical  representation  that  cap¬ 
tures  the  different  evaluation  paths  arising  during  evaluation  of  a  query  in  a  Horn 
clause  program— viz.,  lack  of  control  information,  i.e.,  information  about  how  a  solution 
to  a  query  is  found. 

In  this  model,  a  process  is  created  for:  evaluating  user-defined  literals,  non- 
compilable  primitives,  and  conjunctions  of  literals;  and  for  searching  for  a  candidate 
clause  during  evaluation  of  a  literal.  The  state  of  a  PARLOG  evaluation  is  represented 
by  a  process  structure  called  the  AND/OR  process  tree.  The  nodes  in  this  tree  are 
processes.  The  leaf  processes  are  either  runnable  or  suspended  on  some  variable.  The 
non-leaf  processes  are  not  runnable.  They  await  results  from  their  child  processes.  There 
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are  two  types  of  non-leaf  processes:  AND  processes  and  OR  processes.  A  process 
assumes  a  type  AND  if  it  is  to  evaluate  a  conjunction  of  literals.  A  conjunction  can  be 
a  sequential  conjunction  or  a  parallel  conjunction.  A  conjunction  may  consist  of  just  a 
single  literal.  A  process  assumes  a  type  OR  if  it  is  to  search  for  a  candidate  clause 
among  the  clauses  defining  a  relation. 

The  evaluation  of  a  literal  R  starts  out  by  searching  for  a  candidate  clause.  If  a 
sequential  group  of  clauses,  i.e.,  separated  clauses,  is  to  be  searched,  the  process 
evaluating  the  literal  initially  spawns  a  child  process  to  evaluate  the  guard  of  the  first 
clause.  It  then  becomes  an  OR  process.  Next,  it  sets  its  continuation,  which  for  an  OR 
process  is  the  code  to  be  executed  if  the  guard  that  is  currently  being  evaluated  fails. 
The  guards  of  the  other  clauses  are  also  evaluated  in  the  same  child  process.  However, 
for  the  other  clauses,  a  guard  is  evaluated  only  if  the  guard  of  the  preceding  clause  fails. 
For  example,  if  a  relation  R  is  defined  as  follows: 

R  -  Gx  :  By, 

R  -  G2  :  B2; 

*  -  By 

the  process  evaluating  the  literal  R  spawns  a  process  to  evaluate  G  x.  It  then  becomes 
an  OR  process.  Finally,  it  sets  its  continuation  to  the  code  that  will  spawn  a  process 
for  evaluating  Gr 

If  a  parallel  group  of  clauses,  i.e.,  separated  clauses,  is  to  be  searched,  the  pro¬ 
cess  evaluating  the  literal  spawns  child  processes  for  evaluating  the  guards  of  these 
clauses.  These  processes  are  executed  in  parallel.  The  spawning  process  becomes  an  OR 
process.  It  then  sets  its  continuation  to  the  code  to  execute  in  case  all  the  spawned 
processes  fail.  For  example,  if  a  relation  R  is  defined  as  follows: 

R  -  Gx  :  Bv 

R  —  G2  :  B2\ 

R  -By 

the  process  evaluating  R  spawns  two  processes,  one  for  evaluating  the  guard  Gj,  and 
the  other  for  evaluating  the  guard  Gy  These  two  processes  are  evaluated  in  parallel. 
The  continuation  of  the  spawning  process  is  then  set  to  the  code  that  will  spawn  a  pro¬ 
cess  for  evaluating  the  body  By 

When  a  process  evaluating  a  guard  succeeds,  its  parent  will  commit  to  the  clause 
containing  the  guard,  in  case  it  hasn’t  committed  to  any  other  clause.  The  action  of 
committment  is  manifested  by  the  parent  process  proceeding  to  evaluate  the  body  of  the 
clause  committed  to,  and  terminating  the  processes  evaluating  the  other  guards,  if  any. 
Thus,  in  the  last  example  above,  if  G2  succeeds  before  G  v  the  parent  process  will 
proceed  by  evaluating  By  and  terminating  Gr 

If  a  sequential  conjunction  of  literals,  i.e.,  "fe”  separated  literals,  is  to  be  evaluated, 
the  process  evaluating  the  conjunction  checks  if  the  conjunction  begins  with  a  sequence 
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of  primitive  instructions  (see  chapter  5  of  [Greg85]).  If  so,  the  primitive  instructions  are 
executed  within  the  process  evaluating  the  conjunction.  If,  however,  the  sequential 
conjunction  begins  with  a  user-defined  relation  call,  or  a  non-compilable  primitive,  the 
process  initially  spawns  a  child  process  to  evaluate  the  first  literal  in  the  conjunction.  It 
then  becomes  an  AND  process.  Next,  it  sets  its  continuation  to  the  code  to  execute  if 
the  literal  being  currently  evaluated,  succeeds.  For  example,  the  process  evaluating  the 
conjunction: 

A  &  B 

first  spawns  a  process  for  evaluating  the  literal  A .  It  then  becomes  an  AND  process. 
Finally,  it  sets  its  continuation  to  the  code  that  will  spawn  a  process  for  evaluating  the 
literal  B. 

If  a  parallel  conjunction  of  literals,  i.e.,  separated  literals,  is  to  be  evaluated, 
the  process  evaluating  the  conjunction  spawns  child  processes  to  evaluate  the  literals  in 
the  conjunction.  These  processes  will  be  evaluated  in  parallel.  The  spawning  process 
becomes  an  AND  process.  It  then  sets  its  continuation  to  the  code  to  execute  if  all  the 
child  processes  succeed.  For  example,  the  process  evaluating  the  conjunction 

(A,  B)  &  C 

spawns  two  processes,  one  for  evaluating  the  literal  A ,  and  the  other  for  evaluating  the 
literal  B.  The  continuation  of  the  spawning  process  is  set  to  the  code  that  will  spawn  a 
process  to  evaluate  the  literal  C . 

If  a  parallel  conjunction  occurs  at  the  end  of  a  clause  body,  the  tail  forking  optimi¬ 
zation  is  applicable.  This  optimization  is  a  generalization  of  the  tail  recursion  optimiza¬ 
tion  applicable  to  sequential  logic  programs.  Consider  the  following: 

R  -  pv  P2. 

P\-Bxe  b2&  ■■■  &  (pu,  p12). 

After  successful  evaluation  of  the  sequential  conjunction  Bx  &  B2  &  •  •  •  &  Bn,  the 
AND  process  that  evaluated  this  conjunction  can  proceed  by  spawning  two  child 
processes— one  for  evaluating  the  literal  Pn,  and  the  other  for  evaluating  the  literal  P,2. 
However,  there  is  no  need  to  increase  the  depth  of  the  process  tree.  Since  the  parent  of 
this  process,  i.e.  the  process  evaluating  the  conjunction  (Pv  P2),  is  already  an  AND 
process,  the  child  processes  for  evaluating  Pu  and  P12  can  be  attached  as  children  of 
that  process.  There  would  then  be  an  AND  process  evaluating  the  conjunction 

(Piv  P\v  P 2)' 

The  tail  forking  optimization  is  very  import-  it  since  it  prevents  a  steady  increase 
in  the  depth  of  the  process  tree  during  a  long  evaluation.  Such  an  increase  would  tend 
to  occur,  for  example,  in  the  evaluation  of  recursive  calls. 

It  is  useful  to  observe  that  an  AND  process  is  created  even  when  a  conjunction 
consisting  of  just  a  single  literal  is  to  be  evaluated.  Likewise,  an  OR  process  is  created 
to  control  the  search  for  a  candidate  clause  even  when  there  is  just  one  clause  defining  a 
relation. 


We  will  now  trace  the  evaluation  of  a  PARLOG  query,  say  :A  ,  B.  The  evaluation 
proceeds  as  though  the  query  was  :Q,  with  a  user-defined  relation  Q,  defined  by 
Q  -  A,  B.  A  process,  say  P . ,  is  created  to  evaluate  the  conjunction  consisting  of  the 
single  literal  Q.  Since  P{  is  to  evaluate  a  conjunction,  it  becomes  an  AND  process. 
Since  Q  is  assumed  to  be  a  user-defined  relation,  Pl  forks,  creating  a  child  process,  say 
P2,  to  evaluate  the  literal  Q.  P2  becomes  an  OR  process  in  order  to  search  for  a  candi¬ 
date  clause.  Since  there  is  only  one  clause  defining  Q  and  its  guard  is  empty,  P2  com¬ 
mits  to  that  clause.  That  is  P2  becomes  an  AND  process  evaluating  the  conjunction  of 
literals  A  and  B.  P2  then  forks,  creating  two  processes  to  evaluate  the  literals  A  and 
B.  These  latter  processes  become  OR  processes  in  order  to  search  for  a  candidate 
clause.  After  committment,  they  become  AND  processes  evaluating  the  conjunction  of 
body  literals.  The  evaluation  of  the  query  continues  in  this  manner  until  the  AND  pro¬ 
cess  at  the  root  of  the  AND/OR  process  tree,  viz.,  P,,  either  succeeds  or  fails. 

Since  the  control  structures  change  the  state  of  the  AND/OR  process  tree,  they  can 
be  regarded  as  the  control  instructions  of  the  abstract  AND/OR  process  model.  In 
order  to  support  PARLOG  s  control  structures,  three  additional  control  instructions  are 
needed:  success,  commit,  and  fail.  These  are  described  in  the  following  paragraphs. 

Success  corresponds  to  the  successful  evaluation  of  a  single  literal  or  a  conjunction 
of  literals.  The  parent  of  a  succeeding  process,  i.e. ,  one  evaluating  either  a  single  literal 
or  a  conjunction  of  literals,  always  has  a  process  type  of  AND .  If  the  succeeding  process 
has  no  siblings,  it  is  disposed  of  and  its  parent  resumes  execution  at  its  continuation.  If 
there  is  no  continuation,  the  parent  process  reports  success  to  its  parent.  If  the  succeed¬ 
ing  process  has  siblings,  it  is  simply  disposed  of. 

Commit  corresponds  to  the  operator  of  PARLOG.  We  have  already  discussed 
the  commit  operation  when  we  discussed  the  evaluation  of  a  conjunction  of  guard 
literals. 

Fail  corresponds  to  the  FAIL  instruction  of  PARLOG  as  well  as  failure  in  the 
evaluation  of  any  literal.  The  effect  of  failure  depends  upon  the  type  of  the  parent  pro¬ 
cess. 

If  the  parent  of  a  failing  process  is  an  OR  process,  the  failing  process  is  a  process 
evaluating  a  clause  guard.  If  the  failing  process  has  no  siblings,  it  is  disposed  of,  and  its 
parent  is  reactivated  at  its  continuation.  If  there  is  no  continuation,  the  parent  process 
reports  failure  to  its  parent.  If  the  failing  process  has  siblings,  it  is  simply  disposed  of. 

If  the  parent  of  a  failing  process  is  an  AND  process,  the  failing  process  is  a  process 
evaluating  a  literal  in  a  conjunction.  In  this  case  the  entire  conjunction  fails.  The  fail¬ 
ing  process  and  its  siblings,  if  any,  are  disposed  of.  The  parent  process  reports  failure  to 
its  parent. 

3.1.2.  Data  Objects  and  their  Representation 

Data  objects  in  PARLOG  are  called  terms.  A  term  is  a  constant,  a  variable,  or  a 
structured  term.  A  structured  term  is  an  n-tuple,  optionally  prefixed  by  a  functor.  The 
components  of  the  n-tuple,  in  turn,  are  terms.  An  example  of  a  structured  term  is 
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F(tv  tv  ...,  tj 

where,  F  is  the  functor.  The  arity  of  a  structured  term  is  the  number  of  components  in 
its  tuple. 

Notice  that  a  structured  term  looks  like  a  function  call,  the  functor  being  the  func¬ 
tion  name,  and  the  tuple,  the  call  arguments.  Indeed,  we  will  refer  to  structured  terms 
as  structures,  and  the  components  of  the  tuple  as  the  arguments  of  the  structure. 

As  another  point  of  terminology,  we  will  refer  to  the  memory  where  terms  are 
created  during  run-time  as  term  memory;  and  the  memory  that  contains  the  program  as 
clause  memory. 

We  will  now  describe  how  constants,  structures,  and  variables  are  represented. 
Constants  are  represented  as: 


TCO.\  type  contents 


contents 


Integer,  real,  atom,  string 

Integer  or  real.  Atoms  and  strings  are  also  represented  as 
integers. 


Structures  are  represented  as: 


TSTR  functor -name  arity  arg  -  pointer 


arg  -  pointer  Pointer  to  the  first  argument  of  the  structure.  The  arguments  of 
a  structure  are  contiguous. 

f  unctor  -  name  Integer,  identifying  the  name  of  the  structure. 
arity  Number  of  arguments  in  the  structure. 


Variables  are  represented  as  follows: 


TVAR  b-ub-tp  l-nl  diffvar 


b  -  ub  -  tp 


BOUND,  if  the  variable  appears  in  term  memory,  and  it  is  bound 
to  a  structure  or  another  variable. 


UNBOUND,  if  the  variable  appears  in  term  memory,  and  it  is 
uninstantiated. 


I  -  nl 


TEMPLATE,  if  the  variable  appears  in  clause  memory. 
This  field  would  be  needed  only  in  parallel  architectures. 


LOCAL,  if  the  variable  appears  in  term  memory,  is  bound  to 
another  data  object,  and  that  data  object  is  local,  i.e.,  in  the 
same  PE:  if  it  is  uninstantiated;  or  if  it  appears  in  clause 


memory. 
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dif  fvar 


NON-LOCAL,  if  the  variable  appears  in  term  memory,  is  bound 
to  another  data  object,  and  that  data  object  is  in  some  other  PE. 

If  b-ub-tp  =  BOUND,  this  field  contains  the  address  of  the 
data  object  that  the  variable  is  bound  to. 


If  b  —  ub  —  tp  =  UNBOUND,  this  field  contains  a  pointer  to  a 
data  structure  called  a  demand  list,  attached  to  the  variable. 
When  a  reference  to  an  uninstantiated  variable  is  made,  a 
demand  for  it  is  enqueued  in  the  demand  list.  The  enqueued 
demand  contains  all  information  necessary  to  restart  the  compu¬ 
tation  that  caused  the  reference.  Should  the  variable  get  instan¬ 
tiated  to  a  non-variable  term,  the  computation  corresponding  to 
each  enqueued  demand  is  re-scheduled  for  execution.  Demand 
lists,  thus,  are  the  mechanism  we  use  to  implement  suspension. 
The  basic  idea  behind  demand  lists  is  exactly  the  same  as  /- 
structures  [Thom80];  however,  the  implementation  details  are 
different. 


b-ub  —  tp  =  TEMPLATE  implies  that  the  variable  is  in  clause 
memory.  In  this  case,  the  dif  fvar  field  contains  the  variable’s  ID 
number,  a  number  that  uniquely  identifies  it  within  a  clause. 


In  PARLOG,  complex  data  objects  are  aggregated  by  nesting  structured  terms.  An 
example  of  a  complex  data  object  is  the  term  F(A(B(x,  y),  C(a,  b)).P(q ,  R{s)),  z).  As 
syntactic  sugar,  nested  structures  with  functor  name  CO.YS  are  written  in  list  notation. 
Thus,  [a,  b]  is  equivalent  to  CONS(a,  CO\S(b,  NIL)).  Also  [a  b ]  is  equivalent  to 
CONS  (a,  6),  and  the  empty  list,  [],  to  the  constant  NIL.  We  will  assume  that  terms 
with  empty  functors,  i.e..  just  tuples,  have  an  implicit  functor  name,  COMMA,  and 
write  such  terms  as  nested  structures.  For  example,  ((a,  b).  c ,  d)  is  equivalent  to 
COMMA  (COMMA  (COMMA  {a,  b),  c),  d). 

In  our  computational  model,  complex  (and  simple)  data  objects  are  Directed  Acy¬ 
clic  Graphs  (DAGs).  Figure  3.1  shows  several  examples.  The  leaves  of  the  DAG,  i.e., 
the  nodes  with  outdegree  zero,  correspond  to  variables  and  constants.  The  indegree  of 
leaf  nodes  may  be  greater  than  one.  Such  would  be  the  case  for  variables  that  occur 
more  than  once  in  a  term  (see  figure  3.1b).  The  nodes  with  outdegree  greater  than  zero, 
are  non-leaf  nodes.  Their  indegree  is  restricted  to  be  one.  They  correspond  to  functor 


names. 


Terms  will  be  assumed  to  be  globally  addressable,  i.e.  there  is  a  single  system  wide 
virtual  memory;  Addresses  will  consist  of  two  components:  a  PE  number  and  a  within 
PE  address.  The  motivation  for  using  a  single  global  address  space  is  that  stream-AAD 
parallelism  is  very  difficult  to  implement  in  the  absence  of  shared  memory.  This  is 
because,  if  a  variable  t’  is  shared  among  several  processes,  it  is  not  known  in  advance 
which  process  will  bind  it.  When  v  does  get  bound  by  some  process,  its  binding  will 
have  to  be  communicated  to  all  the  processors  where  the  other  processes  sharing  v 
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Figure  3.1.  DAG  Representation  of  Terms 

reside.  This  communication  can  get  very  complicated  in  the  absence  of  shared  memory. 

Thus  far,  we  have  described  data  objects  that  are  part  of  the  language  definition. 
However,  during  execution  of  the  language,  data  objects,  called  process  descriptors, 
which  are  not  part  of  the  language  definition,  are  created  and  destroyed.  Process 
descriptors  are  data  structures  representing  the  abstract  AND/OR  processes.  Indeed, 
whenever  we  use  the  term  "process",  we  are  actually  talking  about  a  process  descriptor 
data  structure.  We  will  refer  to  the  memory  in  process  descriptors  are  created  as 
descriptor  memory. 

We  briefly  explain  the  meaning  of  the  different  fields  comprising  a  process  descrip¬ 
tor.  Their  meaning  will  become  more  clear  in  section  3.2,  when  we  describe  execution  of 
PARLOG  programs  on  the  abstract  machine. 
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« srunntng  This  field  serves  as  a  lock  on  the  process  descriptor.  The  reason  such 

a  field  is  necessary  is  that,  in  the  interest  of  parallelism,  we  allow 
growth  and  pruning  of  the  process  tree  to  proceed  concurrently.  The 
lock  serves  to  synchronize  these  operations.  It  is  set  whenever  a  node 
is  added  to  the  process  tree;  it  is  reset  whenever  the  process  tree  is 
pruned. 

killflag  This  field  is  set  if  the  process  descriptor  is  locked  (i.e.,  the  isrunning 

field  is  set),  but  it  is  to  be  pruned  away. 

state  This  field  indicates  whether  the  process  is  an  AND  process  or  an  OR 

process. 

parent  —  state  Same  for  the  parent  of  this  process. 

child  —  count  For  an  AND  process,  this  field  indicates  the  number  of  literals  in  the 
conjunction  being  evaluated  by  the  process.  This  includes  user 
defined  relation  calls  as  well  as  calls  to  system  primitives. 

For  an  OR  process,  it  indicates  the  number  of  clauses  being  searched 
under  its  control. 

owu  -  count  For  an  AND  process,  this  field  indicates  the  number  of  needed  vari¬ 
ables  appearing  on  the  left  of  "<  =  "  calls  in  the  conjunction.  A  vari¬ 
able  appearing  on  the  left  of  a  "<  =  *  is  said  to  be  not  needed  in  the 
clause  containing  the  "<  =  ",  if  it  appears  in  no  other  literal  in  the 
clause.  The  diffvar  field  of  such  variables  is  set  to  NOT-NEEDED, 
instead  of  their  ID  number. 

For  an  OR  process,  this  field  is  irrelevant. 

PID  Process  ID.  This  field  is  comprised  of  two  parts:  a  PE  number,  and  a 

process  number.  Thus,  every  process  has  a  unique,  system  wide  PID. 
PID s  are  not  reusable. 

pPID  This  field  contains  the  process  ID  of  the  parent  process.  The  PID 

and  pPID  fields  link  up  the  AND/OR  process  tree. 

continuation  For  an  AND  {OR)  process,  this  field  indicates  the  code  to  execute 
when  all  its  child  processes  have  succeeded  (failed). 

body  ~  pointer  This  field  has  relevance  for  an  AND  process  evaluating  a  guard,  in 
which  case  it  contains  the  address  of  the  body.  This  information  is 
needed  in  case  the  evaluation  commits  to  the  clause  whose  guard  is 
being  evaluted  by  the  AND  process. 

For  an  OR  process,  this  field  is  irrelevant. 

lit-num  This  field  indicates  which  child  this  process  is  of  its  parent. 

child  -  list  This  field  is  a  list  containing  the  PIDs  of  the  children  of  this  process. 


head -size  For  an  OR  process,  this  field  contains  the  number  of  arguments  in 

the  literal  (relation)  being  evaluated. 

Another  object  that  is  not  part  of  the  language  definition  is  the  pointer  -  vector . 
The  pointer  vector  is  a  vector  of  values  and  addresses.  It  serves  as  the  binding  environ¬ 
ment  for  evaluation  of  literals  and  conjunctions. 

For  an  OR  process,  the  size  of  this  vector  is  equal  to  the  number  of  arguments  in 
the  literal  (relation)  being  evaluated.  The  contents  of  the  pointer  vector  are  the  call 
arguments.  Literal  arguments  that  are  instantiated  to  constants  are  passed  by  value. 
Uninstantiated  literal  arguments  and  those  instantiated  to  structures,  are  passed  by 
reference.  Thus,  each  slot  in  the  pointer  vector  may  contain  either  a  constant  or  an 
address. 

For  an  AND  process,  the  size  of  the  pointer  vector  is  equal  to  the  number  of  argu¬ 
ments  in  the  head  plus  the  number  of  local  variables  in  the  clause.  After  local  variables 
are  allocated,  their  addresses  are  inserted  into  the  pointer  vector.  Should  a  local  vari¬ 
able  get  instantiated  to  a  constant,  its  pointer  pointer  vector  slot  will  be  replaced  by 
that  constant.  Should  it  get  instantiated  to  something  other  than  a  constant,  its  pointer 
vector  slot  is  replaced  by  the  address  of  the  object  it  got  instantiated  to.  The  single 
assignment  property  of  PARLOG  obviates  the  need  for  consistency  checking  of  variable 
bindings  present  in  the  pointer  vector  and  those  present  in  term  memory. 

Other  objects  that  are  not  part  of  the  language  definition,  but  which  are  created 
and  destroyed  during  execution  of  PARLOG  programs  are:  one  way  unification  state, 
and  test  unification  state.  We  defer  the  description  of  these  objects  till  section  3.2, 
where  we  describe  how  one  way  unification  and  test  unification  are  performed  in  the 
abstract  machine. 

3.1.3.  Program  Representation 

PARLOG  programs  are  represented  as  data,  in  order  to  facilitate  evaluation  of  the 
metacall.  Literals  are  represented  as  structures,  with  functor  name  being  the  name  of 
the  literal,  and  arity  equal  to  the  number  of  arguments  of  the  literal.  A  sequential  con¬ 
junction  of  literals  is  represented  as  a  nested  structure,  with  an  implicit  functor  name 
AMPERSAND  and  arity  2.  A  parallel  conjunction  of  literals  is  treated  similarly,  except 
that  the  implicit  functor  name  is  COMMA .  As  an  example,  the  parallel  conjunction 
A{x,  y),  B(y,  z),  C(x),  is  represented  as  COMMA  (COMMA  (A  (x,  y),  B(y.  z)),  C(x)). 

and  are  assumed  to  be  left  associative.  Also,  is  assumed  to  bind  tighter 
than  ■&". 

We  refer  to  a  DAG  whose  root  is  COMMA,  AMPERSAND ,  or  a  literal  name,  as  a 
literal  tree.  The  leaves  of  a  literal  tree  are  variables. 

The  arity  of  the  COMMA  functor  is  set  to  PARALLEL  CONJUNCTION  if  the 
literal  tree  beneath  it  does  not  contain  any  AMPERSAND  functors.  Otherwise,  it  is  set 
to  SEQUENTIAL  CONJUNCTION.  As  will  be  seen  later,  this  information  is  used  to 
determine  whether  the  tail  forking  optimization  is  applicable  or  not. 
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We  now  discuss  the  representation  of  the  PARLOG  unification  primitives,  "<  =  " 
and  "  =  ".  "<  =  "  is  represented  as  a  structure  with  functor  name  ONE-  WAY  —  UNIF . 
Its  arity  is  set  equal  to  the  number  of  variables  in  the  left  argument  that  are  needed  in 
the  clause.  As  explained  in  the  previous  section,  a  variable  appearing  on  the  left  of  a 
"<  =  "  is  said  to  be  not  needed  in  the  clause  containing  the  ”  <  = " ,  if  it  appears  in  no 
other  literal  in  the  clause.  As  will  be  seen  later,  the  count  stored  in  the  arity  field  of 
the  ONE-WAY-UNIF  functor  will  be  used  to  determine  whether  a  conjunction  of 
literals  containing  "<  =  "  has  succeeded. 

In  <=  t2 ,  since  and  t2  can  be  of  type  TSTR ,  TCON,  or  TVAR ,  there  are  9 
cases  possible.  However,  we  transform  the  "<  =  "  call  so  that  a  variable  appears  as  the 
left  argument  of  the  call,  or  as  the  right  argument,  or  as  both  arguments.  As  an  exam¬ 
ple  of  the  kind  of  transformation  implied,  ntl  <=  nt2,  where  ntl  and  n<2  are  non¬ 
variable  terms,  is  transformed  to  nfj  <=  v,  v  <=  nt2,  v  a  variable. 

The  same  transformation  is  done  in  the  SPM  as  well.  However,  in  the  SPM, 
further  transformations  are  made,  which  compile  the  one  way  unification  to  a  sequence 
of  primitive  instructions.  The  transformations  above  merely  change  the  syntactic  form 
of  calls.  On  the  other  hand,  the  transformations  in  the  SPM  introduce  significant 
efficiency  benefits  by  reducing  the  run-time  overhead  for  a  sequential  architecture.  How¬ 
ever,  sequential  execution  of  the  primitive  instructions  might  obscure  parallelism  possi¬ 
ble  if  v  is  distributed  among  different  PEs.  In  order  not  to  lose  this  parallelism,  we 
choose  not  to  compile  one  way  unifications  down  to  low  levels. 

Wre  transform  test  unification  calls  (*  =  ")  so  that  both  arguments  are  variables. 
For  example,  [a,  6]  =  v  is  transformed  to  w  =  v,  w  <=  [a,  6|;  [a,  b J  =  [c,  d\  is 
transformed  to  tv  =  v,  w  <=  [a,  6],  v  <=  [c,  d\.  tv  =  v  is  then  represented  as  a 
structure  with  functor  name  TUM  and  arity  2.  If  one  of  the  arguments  is  ground,  the 
test  unification  call  is  equivalent  to  the  one  way  unification  call.  For  example, 
[A ,  B]  =  t;  is  equivalent  to  [A,  B ]  <=  t>.  If  both  arguments  are  ground,  the  test 
unification  can  be  performed  at  compile  time,  flagging  an  error  if  necessary. 

Clauses  are  represented  as  structures,  with  functor  name  BACKARROW.  The 
arity  of  the  BACKARROW  functor  is  set  equal  to  the  number  of  local  variables  in  the 
clause.  The  first  argument  of  this  functor  is  the  clause  head,  a  literal.  The  second  argu¬ 
ment  of  the  BACKARROW  functor  is  the  clause  guard,  which  can  be  either  empty,  or 
a  conjunction  of  literals.  If  the  guard  is  empty,  the  second  argument  is  the  constant 
EMPTY,  otherwise  it  is  a  structure  whose  functor  name  is  either  COMMA  or  AMPER¬ 
SAND.  The  third  argument  of  the  BACKARROW  functor  is  the  clause  body,  which  is 
represented  the  same  way  as  the  clause  guard. 

Relations  are  sets  of  clauses  composed  with  the  operators  ".n  and  A 
separated  set  of  clauses  is  represented  as  a  nested  structure  with  implicit  functor  name 
DOT  ( SEMICOLON )  and  arity  2  (see  figure  3.2).  n."  is  assumed  to  bind  tighter  than 

n .  if 

t  • 

It  should  be  clear  that  any  DAG  with  root  DOT  or  SEMICOLON  would  have  to 
have  at  least  one  sub-DAG  rooted  with  BACKARROW .  We  refer  to  the  sub-tree, 
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Figure  3.2.  Example  of  program  representation 


beginning  with  the  root  and  including  the  BACKARROW  functors,  as  a  clause  tree. 
Thus,  a  clause  tree  has  either  a  single  node,  the  BACKARROW  functor;  or  several 
nodes  in  which  case  its  root  is  the  DOT  or  SEMICOLON  functor,  and  its  leaves  are 
BACKARROW  functors. 


The  neutral  composition  operators,  " "  and  "and",  are  replaced  by  n."  and 
respectively. 

The  arguments  of  literals  are  guaranteed  to  be  variables  or  constants.  This  is 
because  we  replace  literals  like  A(B(x,  y),  C(D{e,  f ),  g))  by  A{a,  b),  a  <=  B(x,  y), 
b  <=  C(D(e,  f ),  g).  The  result  of  this  transformation  is  that  the  unification  related 
functors,  ONE  -  WAY  -  UNIF,  and  TUM,  are  the  only  ones  that  can  have  structures  as 
arguments. 

The  PARLOG  system  primitives  (Less,  Times,  Plus,  Lesseq,  Call,  etc.)  are  com¬ 
piled  so  as  to  ensure  that  their  arguments  are  instantiated  at  the  time  the  primitive  is 
evaluated.  For  example,  Call(x)  is  transformed  to  DATA(x)  &  CALL(x)-, 
Times{x ,y ,z)  is  transformed  to  (DATA(x),  DATA(y))  8c  TIMES(x,y ,zl)  &  z  <=  z  1. 


3.1.4.  Operations 

In  this  section,  we  motivate  the  operations  needed  to  execute  PARLOG  programs 
in  order  to  motivate  them.  We  will  give  a  detailed  description  in  section  3.2,  when  we 
describe  the  abstract  machine. 
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The  operations  needed  to  execute  PARLOG  programs  fall  under  two  categories: 
unification  and  operations  that  implement  the  PARLOG  control  structures.  They  are: 

•  process  tree  growth; 

•  output  matching; 

•  input  matching  (i.e.,  one  way  unification), 

•  test  unification;  and 

•  process  tree  management. 

Process  tree  growth  and  process  tree  management  implement  the  control  structures 
of  PARLOG.  Process  tree  growth  is  the  creation  of  process  descriptors.  Process  tree 
management  is  the  execution  of  the  control  instructions,  commit,  success  and  fail, 
described  in  section  3.1.1.  That  is,  process  tree  management  is  the  pruning  of  AND/OR 
process  tree.  The  reason  for  separating  process  tree  growth  and  process  tree  manage¬ 
ment  is  to  generate  more  parallelism. 

The  other  three  operations  are  related  to  unification.  Output  matching  is  the 
evaluation  of  a  n<  =  n  call  in  which  the  left  argument  is  a  variable.  It  is  the  operation 
by  which  terms  in  PARLOG  programs  are  created.  One  way  unification  is  the  evalua¬ 
tion  of  a  "<  =  "  call  in  which  the  left  argument  is  a  non-variable  term.  Test  unification 
is  the  evaluation  of  an  "  =  "  call. 

3.2.  A  Parallel  Abstract  Machine  for  PARLOG 

In  this  section,  we  describe  a  parallel  abstract  machine  for  executing  PARLOG  pro¬ 
grams.  We  call  this  abstract  machine  (.AMP)2— Asynchronous  Message-passing  based 
Parallel  Abstract  Machine  for  PARLOG. 

The  architecture  of  (AMP)2  is  shown  in  figure  3.3.  (AMP)2  consists  of  abstract 
processing  elements  (PEs)  linked  together  by  an  abstract  network,  called  the  inter-PE 
interconnection  network.  Each  abstract  PE  is  a  collection  of  the  following  computing 
agents:  User  Interface  (£//),  Network  Send  (NS),  Network  Receive  (NR),  Process  Tree 
Manager  (PTM),  Process  Tree  Grower  (PTG),  Input  Matcher  (IM),  Test  Unifier  (TU), 
Output  Matcher  (OM),  Data  Checker  (DC),  Var  Checker  (VC),  and  Term  Memory 
Allocator  ( TMA ).  Each  agent  performs  a  dedicated  function.  The  PTM,  PTG,  TU, 
IM  and  OM  agents  respectively  perform  the  five  operations  we  motivated  in  section 
3.1.4:  process  tree  management,  process  tree  growth,  test  unification,  input  matching, 
and  output  matching.  The  DC  and  VC  agents  respectively  evaluate  the  DATA  and 
VAR  system  primitives  of  PARLOG.  The  DATA  primitive  succeeds  if  its  argument  is 
instantiated  to  a  non-variable  term.  Otherwise,  the  call  suspends.  It  can  never  fail. 
The  VAR  primitive  checks  whether  its  argument  is  instantiated  to  a  variable  at  the 
time  of  the  call.  If  so,  it  succeeds,  else  it  fails.  The  NS  agent  receives  messages  from 
other  agents  in  the  PE,  and  sends  them  out  to  the  network.  The  NR  agent  receives 
messages  from  the  network  and  routes  them  to  the  appropriate  agents  in  the  PE.  The 
UI  is  present  only  in  PEg  and  is  the  agent  through  which  communication  with  the  user 
(either  humans  or  application  programs)  is  done. 
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Figure  3.3.  (AMP)2  architecture 


An  agent  can  communicate  with  other  agents  by  sending  messages  over  the 
abstract  intra-PE  communication  network. 


It  can  also  access  one  or  more  of  the  following  resources:  Descriptor  Memory, 
Clause  Memory,  Term  Memory,  Template  Memory,  One  Way  Unification  Scratchpad, 
Test  Unification  Scratchpad,  and  Output  Match  Scratchpad.  Agents  access  resources 
via  the  abstract  intra-PE  resource  access  network. 

Each  agent  has  a  mailbox,  in  which  it  receives  messages.  It  processes  messages  in 
its  mailbox,  independently  of  the  other  agents.  Thus,  parallelism  is  exploited  at  two  lev¬ 
els  in  (AMP)2— several  PEs  processing  concurrently,  and  several  agents  within  a  PE  pro¬ 
cessing  messages  concurrently. 

(AMP)2  is  a  loosely  coupled  multiprocessor.  However,  it  implements  a  shared 
memory  abstraction.  That  is,  PARLOG  programs  executing  on  (AMP)2  see  a  single 
address  space.  Why  is  providing  a  shared  memory  abstraction  on  a  loosely  coupled 
architecture  important?  There  are  two  opposing  factors  that  come  into  play  in  PAR- 
LOG  implementations.  On  the  one  hand,  a  loosely  coupled  architecture  is  attractive  for 
performance  reasons.  On  the  other  hand,  a  shared  memory  architecture  is  attractive  for 
stream-AAD  parallelism,  since  it  is  very  difficult  to  implement  stream-AiVD  parallelism 
on  loosely  coupled  architectures  [Greg85].  The  major  feature  of  (AMP)2  is  that  it 
reconciles  these  opposing  factors  by  providing  a  very  efficient  shared  memory  abstrac¬ 
tion  on  a  loosely  coupled  architecture— efficient  in  the  sense  that  the  memory  contention 
and  memory  latency  problems  associated  with  shared  memory  systems  are  absent. 


(AMP)2  is  an  abstract  architecture  and  not  a  physical  hardware  architecture.  Its 
purpose  is  to  lay  bare  all  the  functions  that  need  to  be  performed  in  order  to  execute 
PARLOG  programs.  It  thus  acts  as  a  functional  specification  for  the  hardware  architec¬ 
ture.  (AMP)2  features  only  logical  entities.  For  example,  the  abstract  networks  are 
logical  communication  channels.  These  channels  would  have  to  be  implemented  via 
appropriate  interconnection  networks  (e.g.,  bus,  banyan,  shuffle,  hypercube,  etc.)  in  a 
physical  hardware  architecture.  Likewise,  the  agents  are  also  logical  entities.  They 
would  have  to  be  mapped  onto  physical  hardware  in  order  to  get  a  physical  hardware 
architecture.  There  are  several  possibilities  for  this  mapping.  In  a  one-one  mapping, 
there  is  a  dedicated  hardware  unit  for  each  agent.  In  a  many-one  mapping,  one 
hardware  unit  performs  functions  represented  by  several  agents.  Finally,  in  a  one-many 
mapping,  several  identical  hardware  units  are  dedicated  to  a  single  function. 

3.2.1.  Executing  PARLOG  programs  on  (AMP)2 

The  user's  query  is  read  in  by  the  User  Interface  ( UI )  agent,  which  is  present  only 
in  PEq.  A  query  of  the  form  :A(x,  y),  B(y,  z )  is  treated  as  though  there  were  a  user 
defined  relation  called  QUERY  defined  by  the  clause  QUERY  ~  A(x,  y),  B(y,  z).  The 
UI  agent  creates  an  A.XD  process,  the  root  of  the  AND/OR  process  tree.  In  the  follow¬ 
ing  discussion,  creation  or  deletion  of  a  process  means  creation  or  deletion  of  the  process 
descriptor.  It  does  not  mean  creation  or  deletion  of  a  process  in  the  operating  systems 
sense.  Next,  the  UI  agent  sends  out  a  SPAWN  message  to  the  Network  Send  {NS) 
agent.  A  SPA  W’.Y  message  is  created  whenever  a  user  defined  literal  is  to  be  evaluated. 
It  includes  the  PID  of  the  AND  process  evaluating  the  conjunction  that  the  relation 
call  is  part  of,  and  a  pointer  vector  containing  the  call  arguments.  The  NS  agent 
sends  the  SPA  W7V  message  to  the  inter-PE  communication  network.  In  the  current  ver¬ 
sion  of  (AMP)2,  the  network  uses  a  uniformly  distributed  random  number  generator  to 
route  the  SPA  H7V  message  to  some  random  PE.  In  later  versions,  the  network  may 
incorporate  sophisticated  load  balancing  strategies. 

Process  Tree  Growth 

The  SPA  WN  message  is  routed  by  the  Network  Receive  [NR )  agent  of  the  destina¬ 
tion  PE  to  the  Process  Tree  Grower  ( PTG ),  which  creates  an  OR  process  to  search  for 
a  candidate  clause.  Send/acknowledge  protocols  are  used  to  record  parent  child  rela¬ 
tionships,  which  may  exist  across  several  PEs. 

The  PTG  traverses  the  DAG  of  the  relation  being  evaluated.  Recall  that  each  PE 
has  a  copy  of  the  program  in  its  clause  memory.  On  encountering  a  SEMICOLON 
functor  during  this  traversal,  the  PTG  creates  an  OR  process  descriptor  in  the  descrip¬ 
tor  memory.  The  PTG  sets  the  continuation  field  of  this  process  to  the  address  of  the 
right  subtree  of  the  functor,  and  then  recursively  traverses  the  left  subtree.  For  an  OR 
process,  the  continuation  indicates  which  clause(s)  to  search  next,  if  the  guard  of  the 
current  clause  fails.  The  continuation  field  in  an  OR  process  descriptor  implements  the 

control  construct  of  PARLOG.  On  encountering  a  DOT  functor  during  the  traver¬ 
sal,  the  PTG  simply  recurses  on  the  left  and  right  subtrees. 
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Eventually,  the  FTC  encounters  a  BACKARROW  functor.  At  this  point,  it  is 
ready  to  evaluate  the  guard  of  a  clause.  If  the  guard  is  a  conjunction  of  more  than  one 
literal  (i.e.,  the  third  argument  of  BACKARROW  is  either  AMPERSAND  or 
COMMA ),  it  creates  an  AND  process  to  evaluate  the  guard.  However,  if  the  guard 
contains  just  a  single  literal,  it  does  not  create  an  AND  process  for  it.  This  is  an 
important  optimization.  It  prevents  the  process  tree  from  growing  unnecessarily.  The 
SPM  also  features  this  optimization.  In  addition,  in  the  SPM,  the  process  tree  does  not 
grow  if  the  guard  doesn’t  suspend  (on  one  way  unification),  even  if  it  is  a  conjunction  of 
more  than  one  literal.  The  sequential  nature  of  the  SPM  makes  this  latter  optimiza¬ 
tion  possible:  a  conjunction  of  one  way  unification  calls  is  actually  evaluated  one  at  a 
time,  and  so,  if  none  of  the  calls  suspend,  there  is  no  need  to  create  an  AND  process. 
On  the  other  hand,  in  a  parallel  architecture,  it  is  desirable  to  evaluate  a  conjunction  of 
one  way  unification  calls  concurrently.  It  is  for  this  reason,  the  PTG  does  not  perform 
the  latter  optimization:  it  creates  an  AND  process  even  if  the  guard  contains  only  a 
conjunction  of  one  way  unification  calls,  none  of  which  may  suspend. 

In  a  parallel  machine,  the  AND  processes  corresponding  to  guards  of  different 
clauses  in  a  relation,  may  be  created  on  different  PEs.  This  is  how  committed  OR  paral¬ 
lelism  is  exploited. 

Prior  to  evaluating  the  guard  of  a  clause,  the  PTG  requests  the  Term  Memory 
Allocator  ( TMA )  to  allocate  space  for  the  local  variables  in  the  clause.  It  uses  the 
addresses  returned  by  the  TMA  to  create  a  pointer  vector,  i.e.,  the  binding  environment 
for  the  evaluation.  It  then  continues  traversal  of  the  relation’s  DAG.  Traversal  of  the 
DAG  within  a  clause  is  similar  to  the  traversal  of  the  DAG  of  a  relation,  except  that 
AND  processes  are  created,  and  AMPERSANDS  and  COMMA  s  are  encountered  instead 
of  SEMICOLONs  and  DOTs  respectively. 

Eventually,  the  PTG  encounters  a  user  defined  literal,  a  system  primitive,  or  a 
unification  related  call.  To  evaluate  user  defined  literals,  the  PTG  sends  out  a 
SPA  WN  message,  which  is  processed  at  some  random  PE  as  described  above.  Thus, 
user  defined  literals  may  be  evaluated  in  different  PEs,  concurrently.  System  primitives 
do  not  cause  a  SPAWN  message  to  be  sent.  The  DATA  and  VAR  system  primitives 
are  evaluated  by  the  Data  Checker  {DC)  and  Var  Checker  [VC]  agents  respectively. 
Their  implementation  is  quite  straightforward,  and  we  will  not  describe  it  here.  The 
other  primitives  are  evaluated  directly  by  the  AND  process  that  controls  the  conjunc¬ 
tion  the  primitives  are  part  of. 

In  our  description  of  the  unification  operations  in  (AMP)2,  t;,  tr,  x,  y,  and  z 
denote  variables,  nt  denotes  a  non-variable  term,  and  t  denotes  either  a  variable  or  a 
non- variable  term. 

Output  Matching 

v  <=t,  called  output  matching,  is  the  means  by  which  PARLOG  terms  are 
created.  When  the  PTG  encounters  such  a  call  during  its  traversal,  it  sends  a  message 
to  the  PE  in  whose  address  space  v  is  resident  to  evaluate  the  call.  The  message 
includes  the  call  arguments.  The  call  arguments  depend  upon  the  compile  time  form  of 
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t.  If  the  compile  time  form  of  t  is  a  variable,  the  call  arguments  are  the  term  memory 
addresses  of  v  and  t.  Otherwise,  the  call  arguments  are  the  term  memory  address  of  v 
and  the  clause  memory  address  of  t.  The  message  is  routed  to  the  Output  Matcher 
(OM)  agent  by  the  SR  of  the  destination  PE.  If  v  is  not  an  uninstantiated  variable  at 
the  time  of  the  call,  the  OM  flags  a  run-time  error.  If,  at  compile  time,  t  is  a  variable, 
the  OM  binds  v  to  t.  Otherwise,  it  creates  an  instance  of  t  and  binds  t;  to  that 
object.  After  binding  v,  the  OM  restarts  the  computations  enqueued  on  v’s  demand 
list,  by  sending  appropriate  messages. 

In  general,  t  in  r  <=  t  may  be  in  some  other  PE’s  address  space.  For  example, 
t  may  be  a  local  variable  in  some  other  clause  evaluated  in  the  other  PE;  or  it  might 
have  been  created  via  a  <  =  call  in  that  clause.  Thus,  structured  terms  in  (AMP)^  may 
be  partitioned  across  several  different  PEs. 

One  Way  Unification 

The  distribution  of  terms  across  several  PEs,  coupled  with  the  fact  that  each  PE 
has  a  copy  of  the  program,  makes  possible  a  parallel  algorithm  for  one  way  unification. 
When  the  PTG  encounters  a  one  way  unification  call,  say  nt  <=  t/,  it  sends  a  message 
to  the  PE  in  whose  address  space  t>  is  resident  to  evaluate  the  call.  The  message 
identifies  the  ASD  process  evaluating  the  conjunction  that  the  call  is  part  of.  Let  PID 
denote  its  process  descriptor.  The  message  is  routed  to  the  Input  Matcher  (IM)  by  the 
NR  of  the  destination  PE.  We  refer  to  the  left  argument  of  the  one  way  unification 
call,  as  the  template  DAG  and  the  right  argument  as  the  term  DAG.  We  refer  to  the 
partitions  of  the  term  DAG  as  term  sub  -  DAGS .  Templates,  the  left  arguments  of  all 
one  way  unification  calls,  are  loaded  at  compile  time  into  the  template  memory  of  each 
PE. 

The  IM  traverses,  in  lock  step,  the  term  sub-DAG  and  the  template  DAG.  Its 
behavior  on  encountering  a  constant  or  a  structured  term  in  the  template  DAG  is  simi¬ 
lar.  We  describe  its  behavior  for  the  former  case.  There  are  three  possibilities: 

i) .  The  term  sub-DAG  is  a  non-variable  term  or  is  a  variable  bound  to  a  non-variable 

term.  In  this  case,  the  IM  checks  if  the  non-variable  term  is  a  constant  equal  to 
the  one  in  the  template  DAG.  If  so,  the  traversal  succeeds;  otherwise  it  fails  and 
the  IM  sends  a  FAIL  message  to  the  PE  denoted  by  PID. 

ii) .  The  term  sub-DAG  is  a  variable,  but  it  is  bound  to  an  object  in  some  other  PE. 

An  access  fault  is  said  to  occur  in  this  situation.  On  encountering  an  access  fault, 
the  IM  sends  a  message  containing  all  necessary  information  to  the  other  PE  ask¬ 
ing  it  to  continue  the  one  way  unification.  It  doesn’t  issue  a  memory  request  over 
the  network  to  fetch  remote  data,  and  wait  for  it  to  arrive.  This  is  a  very  impor¬ 
tant  optimization:  rather  than  waiting  for  remote  memory  requests  to  be  resolved, 
the  IM  sends  the  computation  to  where  the  remote  data  is,  and  then  starts  process¬ 
ing  the  next  message  in  its  mailbox. 
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Actually,  the  IM  does  not  send  the  continuation  message  as  soon  as  it  detects  an 
access  fault.  Instead,  it  posts  an  entry  in  a  data  structure  called  the  one  way 
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unification  state,  containing  the  following  information:  the  template  DAG  and  term 
sub-DAG  node  addresses  at  which  the  one  way  unification  is  to  continue  in  the 
other  PE,  PID,  and  the  PE  where  the  access  fault  occurred.  This  data  structure  is 
stored  in  the  one  way  unification  scratchpad.  The  reason  for  posting  such  an  entry 
rather  than  sending  the  computation  right  away  is  that  the  traversal  might  fail  in 
this  PE,  in  which  case  there  is  no  need  to  continue  the  computation.  This  is 
another  very  important  optimization.  Because  of  this  optimization,  a  process  will 
suspend  on  remote  variable  accesses,  only  if  the  <=  call  cannot  succeed  or  fail 
otherwise.  Later,  if  the  traversal  succeeds,  the  IM  will  send  the  computation  out. 
There  is  a  speed  penalty  that  we  have  to  pay  in  this  approach,  since  the  one  way 
unification  cannot  continue  in  other  PEs  until  the  traversal  of  the  entire  sub-DAG 
present  in  this  PE  has  been  completed.  However,  the  speed  penalty  is  likely  to  be 
small,  since  the  sub-DAGs  are  more  than  likely  to  be  small.  Besides,  the  speed 
penalty  has  to  be  weighed  against  the  cost  of  communicating  over  the  network. 

During  the  course  of  the  traversal,  several  entries  may  be  posted  in  the  one  way 
unification  state.  The  more  PEs  where  the  unification  is  to  continue,  the  more  the 
parallelism. 

The  one  way  unification  can  be  considered  to  have  succeeded  only  if  the  traversal 
of  every  sub-DAG  is  successful.  The  successful  traversal  of  a  sub-DAG  is  called 
partial  success.  Messages  denoting  partial  success  are  propagated  in  the  direction 
opposite  to  which  the  continuation  messages  are  propagated.  This  is  why  continua¬ 
tion  messages  identify  the  PE  where  the  access  fault  occurred. 

iii).  The  term  sub-DAG  is  an  uninstantiated  variable.  In  this  case,  the  IM  posts  a 
demand  for  the  value  of  this  variable  in  the  one  way  unification  state.  It  doesn’t 
enqueue  the  demand  right  away  on  the  variable’s  demand  list;  later,  if  the  traversal 
succeeds,  it  will  enqueue  the  demand.  Because  of  this  optimization,  a  process  will 
suspend  on  a  variable  only  if  the  <=  call  cannot  succeed  or  fail  otherwise. 
Enqueueing  demands  is  how  suspension  in  one  way  unification  is  implemented  in 
(AMP)2. 

If  a  variable  in  the  template  DAG  is  encountered,  the  IM  posts  an  entry  in  the  one 
way  unification  state  indicating  that  a  binding  for  a  variable  in  nt  has  been  found.  If 
the  traversal  succeeds,  the  IM  sends  out  the  binding  to  the  PE  denoted  by  PID.  The 
variable  would  have  been  allocated  in  the  term  memory  of  that  PE.  When  the  binding 
reaches  that  PE,  the  OM  agent  of  that  PE  binds  the  variable. 

Teat  Unification 

When  the  PTG  encounters  a  test  unification  call,  tx  =  <2,  it  sends  a  message  to  the 
PE  in  whose  address  space  t^  resides.  The  Test  Unifier  {TV)  agent  in  that  PE  processes 
this  message.  The  evaluation  of  test  unification  calls  is  similar  to  the  evaluation  of  one 
way  unification  calls.  The  only  difference  is  that  no  variables  are  bound.  Also,  there  is 
no  notion  of  a  template  DAG  since  the  arguments  of  the  call  are  not  known  at  compile 
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time.  Therefore,  access  faults  can  occur  on  both  sub-DAGs  during  traversal.  If  an 
access  fault  occurs  when  traversing  the  right  sub-DAG,  the  TU  posts  a  demand  for  the 
right  sub-DAG  node  in  a  data  structure  called  the  test  unification  state,  which  is  stored 
in  the  test  unification  scratchpad.  The  demand  is  basically  a  remote  memory  request. 
If  the  traversal  succeeds,  the  TU  sends  the  remote  memory  request  out  on  the  network. 
It  also  sends  the  context  necessary  to  restart  the  computation  when  the  remote  data 
arrives.  This  is  a  very  important  optimization:  by  sending  context  along  with  remote 
memory  requests,  the  TU  does  not  have  to  wait  for  remote  data  to  arrive;  rather,  it 
starts  processing  the  next  message  in  its  mailbox.  If  an  access  fault  occurs  when 
traversing  the  left  sub-DAG,  the  TU  posts  an  entry  indicating  that  the  evaluation  of 
the  call  is  to  be  continued  in  some  other  PE.  Later,  if  the  traversal  succeeds,  the  TU 
sends  continuation  messages  out. 

Process  Tree  Management 

During  the  course  of  the  evaluation  of  a  PARLOG  query,  the  AND/OR  process 
tree  grows  and  shrinks  as  literal  evaluations  (and  conjunction  evaluations,  and  clause 
searches)  succeed  and  fail.  The  PTG  grows  the  AND/OR  process  tree;  the  PTM  prunes 
it.  The  PTM  and  the  PTG  can  asynchronously  access  the  descriptor  memory,  thus 
increasing  parallelism.  However,  concurrent  access  to  a  process  descriptor  must  be 
synchronized.  Therefore,  each  process  descriptor  has  a  lock  field. 

Process  tree  management  is  essentially  the  processing  of  SUCCESS  and  FAIL 
messages.  A  SUCCESS  (FAIL)  message  is  sent  by  a  process  to  its  parent  if  it  succeeds 
(fails).  Let  CPID  denote  the  succeeding  process.  PID  its  parent,  and  PPID  the  parent 
of  the  parent.  The  behavior  of  the  PTM  upon  receipt  of  a  SUCCESS  message  depends 
upon  the  type  of  the  PID  process  and  the  PPID  process 

i) .  PID  and  PPID  are  OR  processes.  This  occurs  when  a  group  of  nested  clauses 

separated  by  and  is  being  searched  for  a  candiate  clause.  The  PTM  des¬ 
troys  PID  and  recursively,  all  of  PID' s  children.  It  then  sends  a  SUCCESS  mes¬ 
sage  to  PPID. 

ii) .  PID  is  an  OR  process,  PPID  is  an  AND  process.  This  occurs  when  CPID  is  a 

guard  that  has  succeeded,  signifying  that  a  candidate  clause  has  been  found.  The 
PTM  sends  a  SPAWN  -  AND  message,  which  contains  PPID  and  a  pointer  to  the 
body  of  the  candidate  clause,  out  on  the  network.  It  then  destroys  PID  and 
recursively,  all  of  PID' s  children,  which  are  the  siblings  of  CPID.  There  may  be 
SUCCESS  or  FAIL  messages  in  transit  from  the  sibling  guards.  However,  these 
messages  will  be  ignored  when  they  arrive  at  their  destination,  since  PID  would 
have  already  been  destroyed.  To  guarantee  that  the  destination  process  is  des¬ 
troyed,  it  is  necessary  to  ensure  that  PIDs  are  not  reused.  This  presents  no  prob¬ 
lems:  all  that  is  needed  is  that  each  PE  use  unique  process  numbers  for  its  PEDs. 
It  can  be  seen  that  the  PARLOG  commit  operator,  ":fl,  is  implemented  in  (AMP)** 
by  executing  the  body  of  the  first  clause  whose  guard  terminates  successfully.  The 
graph  reduction  should  also  be  apparent:  the  process  evaluating  a  literal  is  reduced 
to  another  process  evaluating  the  body  of  the  candidate  clause. 
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The  SPAWN  — AND  message  is  routed  by  the  network  to  a  random  PE,  just  as  the 
SPA  WN  message.  The  SPAWN  -  AND  message  is  processed  by  the  PTG .  The 
PTG  first  checks  if  the  tail  forking  optimization  is  applicable.  If  not,  the  PTG 
creates  an  AND  process  and  sends  a  SPAWN  —  AND  -  ACK  message  to  the  PE 
containing  PPID .  Otherwise,  the  conjunction  of  body  literals  is  controlled  by 
PPID  itself— without  creating  a  new  AND  process.  This  optimization  is  a  generali¬ 
zation  of  the  tail  recursion  optimization  in  traditional  logic  languages  [Greg85].  In 
a  sequential  implementation  of  PARLOG,  the  tail  forking  optimization  applies  if 
there  is  a  parallel  conjunction  at  the  end  of  a  clause.  However,  (AMP)2  being  a 
parallel  machine,  we  insist  on  an  additional  condition:  PPID  should  be  in  the 
same  PE  to  which  the  network  routed  the  SPAWN -AND  message.  In  the 
absence  of  the  second  condition,  we  might  get  a  single  AND  process  with  lots  of 
child  processes.  While  this  is  fine  in  a  sequential  machine,  it  leads  to  an  unaccept¬ 
able  bottleneck  in  a  parallel  machine. 

iii).  PID  is  an  AND  process.  If  PID  does  not  have  a  parent,  i.e. ,  if  PID  is  the  root  of 
the  AND/OR  process  tree,  the  query  evaluation  is  complete.  Otherwise,  the  PTM 
destroys  PID  and  recursively,  all  of  PID' s  children.  It  then  sends  a  SUCCESS 
message  to  PPID. 

Failure  handling  is  quite  straightforward.  If  PID  is  an  AND  process,  the  PTM 
destroys  PID  and  recursively,  all  of  PID  s  children.  It  then  sends  a  FAIL  message  to 
PPID.  If  PID  is  an  OR  process,  the  PTM  checks  if  PID  has  a  continuation.  If  so,  it 
sends  a  message  to  the  PTG  to  initiate  the  evaluation  of  the  guard  of  another  clause. 
If  PID  does  not  have  a  continuation,  the  PTM  destroys  PID  and  recursively,  all  of 
PID  s  children,  and  then  sends  a  FAIL  message  to  PPID. 

3.3.  Mapping  (AMP)2  onto  the  Connection  Machine 

In  this  section,  we  briefly  discuss  the  feasibility  of  mapping  (AMP)2  onto  the  Con¬ 
nection  Machine  architecture,  i.e.,  of  executing  PARLOG  on  the  Connection  Machine. 
As  we  mentioned  at  the  beginning  of  this  chapter,  this  mapping  is  best  done  after 
studying  the  run-time  execution  behavior  of  PARLOG  programs,  to  answer  questions 
such  as:  How  much  parallelism  is  there?  What  is  the  granularity  of  parallelism?  What 
are  the  communication  patterns  between  the  agents’  What  operations  are  performed 
most  frequently?  Do  they  require  hardware  support?  If  so,  what  kind?  What  architec¬ 
tural  features  are  required  for  the  efficient  execution  of  single-solution  PARLOG  pro¬ 
grams?  Should  the  PE  design  be  simple  or  should  it  be  complex?  The  above  study  is  a 
major  effort  in  itself  and  beyond  the  scope  of  this  project.  What  we  present  in  this  sec¬ 
tion  is  a  qualitative  discussion  of  the  feasibility  of  executing  PARLOG  programs  on  the 
Connection  Machine. 

The  Connection  Machine  (CM)  [Hill85],  is  a  fine  grained,  highly  parallel,  SEMD 
(Sing  Instruction  Multiple  Data)  computer.  It  consists  of  64K  processors,  each  with  4K 
bits  of  memory  and  a  1  bit  wide  ALU.  Adding  two  16  bit  numbers  takes  16  machine 
cycles.  The  total  memory  capacity  of  the  CM  is  32  Mbytes. 


The  CM  requires  a  front-end  host  computer  to  issue  instructions.  These  instruc¬ 
tions  are  broadcast  to  all  the  64K  processors,  each  of  which  executes  the  same  instruc¬ 
tion,  operating  on  the  contents  of  its  own  memory.  This  concept  is  called  data  level 
parallelism. 

There  are  two  forms  of  communication  in  the  CM:  a  single  bit  wide  giobal-or  net¬ 
work  and  a  16  dimensional  hypercube.  The  global-or  network  allows  aggregate  opera¬ 
tions  such  as  global-minimum,  global-maximum,  global-or,  etc.,  to  be  performed 
quickly,  while  the  hypercube  network  allows  processors  to  communicate  by  exchanging 
packets  of  information. 

The  form  of  parallelism  in  the  CM,  viz.,  data  level  parallelism,  is  different  from  the 
form  of  parallelism  found  in  control  level  parallel  processing  machines.  In  the  latter, 
the  programmer  is  required  to  divide  his  program  into  fragments,  one  for  each  proces¬ 
sor.  Data  level  parallel  processing  works  best  on  problems  with  large  amounts  of  data, 
whereas  control  level  parallel  processing  works  best  when  the  ratio  of  program  to  data  is 
high. 

The  CM  is  best  suited  to  applications  that  have  a  large  amount  of  data  level  paral¬ 
lelism.  Examples  of  such  applications  include  document  retrieval  from  a  large  biblio¬ 
graphic  database,  image  processing,  VLSI  circuit  design,  and  fluid  flow  problems. 

On  first  glance,  it  would  appear  that  PARLOG  has  plenty  of  data  level 
parallelism  —  different  parts  of  a  large  structured  term  may  be  constructed  in  parallel 
by  different  processes.  For  example,  the  elements  of  the  list  [c j,  e2,  ...,  ej  may  be  con¬ 
structed  in  parallel  by  n  concurrent  processes.  However,  a  PARLOG  process  constructs 
only  the  top  level  structure  of  a  variable’s  binding.  That  is,  it  either  binds  a  variable  to 
a  ground  term,  or  if  it  binds  it  to  a  partially  instantiated  term,  then  the  further  instan¬ 
tiation  of  this  term  is  done  by  another  PARLOG  process.  This  is  a  characteristic 
feature  of  stream -AND  parallelism.  A  process  partially  instantiates  a  variable  and 
passes  this  binding  to  another  process  through  a  shared  variable.  The  second  process,  in 
turn,  partially  instantiates  the  variables  in  this  binding,  and  passes  them  to  a  third  pro¬ 
cess,  and  so  on.  Thus,  the  form  of  parallelism  intrinsic  to  PARLOG  is  one  where 
several  different  processes  act  on  different  portions  of  a  large  data  structure.  In  other 
words,  the  form  of  parallelism  intrinsic  to  PARLOG  is  control  level  parallelism  with 
multiple  threads  of  control. 

This  form  of  parallelism  is  directly  opposite  to  what  is  best  supported  by  the 
CM  —  a  single  process  operating  on  all  portions  of  a  large  structure  concurrently. 
Therefore,  we  conclude  that  the  Connection  Machine,  with  its  SEMD  style  data  level 
parallelism,  is  not  a  good  choice  for  executing  PARLOG.  Indeed  our  experience  in 
designing  (AMP)2  supports  this  conclusion.  A  shared  memory  abstraction  on  a  coarse 
grained,  loosely  coupled  architecture  is  better  suited  to  implementing  PARLOG  than 
the  CM. 
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3.4.  Conclusions 

This  chapter  presented  our  design  of  a  parallel  abstract  machine  for  executing 
PARLOG.  The  principal  conclusion  from  this  work  is  that  shared  memory  greatly  facil¬ 
itates  implementing  PARLOG’s  stream- A ArZ)  parallelism  and  that  the  key  to  high  per¬ 
formance  stream-AiVZ)  parallelism  is  an  efficient  shared  memory  abstraction  on  a  loosely 
coupled  architecture.  The  abstract  machine  described  in  this  chapter  achieves  this  via  a 
number  of  optimizations.  These  optimizations  address  critical  problems  in  the  design  of 
efficient  parallel  architectures.  They  address  the  principal  sources  of  overhead,  viz., 
communication  and  memory  latencies,  and  synchronization  overheads. 

These  problems  are  overcome  because  agents  never  suspend,  waiting  for  remote 
data  to  arrive.  If  a  computation  requires  access  to  remote  data,  the  agent  sends  the 
computation  over  to  where  the  data  is,  whenever  possible  (as  in  one  way  unification, 
and  sometimes,  in  test  unification).  It  then  processes  the  next  message  in  its  mailbox. 
If  a  remote  memory  request  cannot  be  avoided  (as  sometimes  happens  during  evalua¬ 
tion  of  test  unification  calls),  the  agent  sends  a  context  with  the  remote  memory 
request,  and  then  processes  the  next  message  in  its  mailbox.  The  context  enables  res¬ 
tarting  the  computation  when  the  remote  data  arrives.  Thus,  arbitrary  latencies  in 
memory  requests  can  be  tolerated  in  (AMP)2,  which  is  very  important  for  parallel  archi¬ 
tectures  [Iann83]. 

Finally,  we  investigated  the  feasibility  of  executing  PARLOG  programs  on  the 
Connection  Machine  architecture.  The  conclusion  from  this  work  is  that  a  coarse 
grained,  loosely  coupled  architecture  is  better  suited  than  the  CM.  since  the  form  of 
parallelism  best  supported  by  the  CM  is  directly  opposite  to  that  found  in  PARLOG. 
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CHAPTER  4 


Data/Knowledge  Base  Query  Processing  Concepts 


This  chapter  describes  our  investigation  of  the  concepts  relating  to  data/knowledge 
base  (D/KB)  query  processing.  It  is  organized  as  follows.  Section  4.1  gives  several 
definitions  relating  the  composition  of  a  data/knowledge  base.  Recursive  query  process¬ 
ing  is  a  key  concept  that  differentiates  D/KB  query  processing  from  traditional  database 
query  processing.  Section  4.2  introduces  definitions  relating  to  recursion.  Section  4.3 
describes  rule  representation.  Section  4.4  describes  top-down  and  bottom-up  evaluation 
of  rule-based  queries.  Section  4.5  describes  evaluation  of  non-recursive  rule-based 
queries.  Section  4.6  describes  the  evaluation  of  recursive  rule-based  queries.  Section  4.7 
describes  the  concepts  pertaining  to  recursive  rule-based  query  optimization. 

4.1.  Data/Knowledge  Base 

We  give  several  definitions  pertaining  to  the  data/knowledge  base.  These 
definitions  appear  in  [Banc86]. 

The  data/knowledge  base  is  a  set  of  Horn  clauses  and  schemas.  A  Horn  clause  has 
the  form 

head  -  body 

where  head  is  zero  or  one  atomic  formulas  (predicates  with  arguments  supplied)  and 
body  is  a  conjunction  of  zero  or  more  atomic  formulas.  All  arguments  that  are  vari¬ 
ables  are  implicitly  unversally  quantified.  The  logical  interpretation  of  a  Horn  clause  is 
that  the  body  implies  the  head. 

Example:  a(X,  Y )  -  b(X,  Y'),  c(Z,  Y)  means  that  for  all  X,  Y,  and  Z ,  b( X,  Y)  and 
c(Z,  Y)  implies  a  (A",  Y).  [j 

A  relation  definition  is  the  set  of  clauses  whose  head  refers  to  a  given  relation. 

A  Horn  clause  with  an  empty  body  and  no  variables  in  its  head  is  called  a  fact. 
Facts  can  be  written  with  no  implication  sign,  e.g.,  parent  (a,  b)  means  that  a  is  the 
parent  of  6 . 

A  rule  is  a  Horn  clause  that  is  not  a  fact. 

Predicates  corresponding  to  facts  are  called  base  predicates.  Predicates  correspond¬ 
ing  to  the  head  of  a  rule  are  called  derived  predicates. 

We  can  assume  without  loss  of  generality  that  a  relation  is  defined  entirely  by  rules 
or  entirely  by  facts.  If  a  set  of  clauses  does  not  meet  this  condition,  it  can  easily  be 
transformed  into  a  set  of  clauses  that  does. 

Example:  The  set  of  clauses 
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P(X,  Y)  -  a(X,  Z),  b(Z ,  V). 

P{a ,  6). 

P(c,  <0- 

is  equivalent  to 

p(X,  Y)  -  a(X,  Z),  b(Z ,  Y). 

P(X,  Y)  -  Pl(X,  Y). 

P i(a  >  *)■ 

Px{c,d).  [] 

Thus,  the  data/knowledge  base  can  be  partitioned  into  rule  relations  and  fact  relations. 
The  set  of  rule  relations  is  called  the  intensional  knowledge  base,  or  rulebase,  while  the 
set  of  fact  relations  is  called  the  extensional  knowledge  base,  or  database.  The  inten¬ 
sional  knowledge  base  contains  only  derived  predicates,  while  the  extensional  knowledge 
base  only  base  predicates. 

The  motivation  for  distinguishing  between  the  intensional  and  extensional 
knowledge  bases  is  that  rules  are  stored  in  compiled  form  to  allow  for  more  efficient 
access,  while  facts  are  stored  directly.  We  will  describe  different  storage  structures  for 
storing  the  compiled  form  of  rules  later  in  this  chapter. 

The  schema  of  a  base  predicate  is  the  same  as  a  relational  database  schema;  it  con¬ 
tains  the  names  of  the  arguments  and  their  types.  The  schema  of  a  derived  predicate  is 
derived  using  the  base  predicate  schema  and  the  rules. 

4.2.  Recursion 

In  this  section  we  give  several  definitions  pertaining  to  evaluating  recursive  queries. 
Again,  these  definitions  appear  in  [Banc86]. 

We  will  use  the  set  of  Horn  clauses  shown  in  figure  4.1  to  illustrate  the  definitions. 
The  bt's  in  this  example  are  base  predicates,  while  p,  q,  pv  and  p2  are  derived  predi¬ 
cates. 

A  derived  predicate  q  is  reachable  from  a  derived  predicate  p  if 

(i)  q  is  in  the  body  of  a  rule  having  p  as  its  head,  or 

(ii)  q  is  in  the  body  of  a  rule  having  s  as  its  head  and  s  is  reachable  from  p. 

Example:  In  figure  4.1  Pj  is  reachable  from  p.  So  is  b  t  because  bx  is  reachable  from  px 

and  p,  is  reachable  from  p.  [] 

Two  derived  predicates  p  and  q  are  mutually  recursive  if  they  are  reachable  from 
each  other. 

Example:  In  figure  4.1  p  and  q  are  mutually  recursive.  [] 

A  predicate  p  is  recursive  if  it  is  reachable  from  itself. 
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Ry  p(X,Y)-Pl(X,Z),q(Z,Y). 
R2:  p(X,  Y)  -  b3(X,  Y). 

Ry  Pl(x,  Y)  -  bx(X,  Z),  px(Z,  Y). 
fl4:  Pl(X,  Y)  -  b4(X,  Y). 

Ry  pa(x,  Y)  -  b2(x,  Z),  P2(Z,  Y). 
Ry  P2(X,  Y)  -  b5(x,  Y). 

R.:  q(X,  Y)  -  p(A\  Z),  p2(Z,  Y). 


Figure  4.1:  Sample  data/knowledge  base 

Example:  In  figure  4.1  p(  is  a  recursive  predicate.  [] 

A  rule  p  -  p r  p2 . pn  is  called  a  recursive  rule  is  there  exists  a  p.  in  the  body 

that  is  mutually  recursive  to  p . 

Example:  In  figure  4.1  /?3  is  a  recursive  rule.  [] 

Two  rules  are  mutually  recursive  if  the  predicates  in  their  heads  are  mutually 
recursive. 

Example:  In  figure  4.1  R{  and  R?  are  recursive  rules.  [] 

A  recursive  rule  is  called  a  linear  recursive  rule  if  there  is  only  one  predicate  in  the 
body  that  is  mutually  recursive  to  the  head.  A  recursive  rule  that  is  not  linear  is  called 
a  nonlinear  recursive  rule. 

Example:  All  the  recursive  rules  in  figure  4.1  are  linear.  However,  a  rule  of  the  form 
p  -  pvpvr  is  nonlinear.  [] 

It  can  be  easily  shown  that  mutual  recursion  is  an  equivalence  relation  on  the  set  of 
derived  predicates  and  the  set  of  rules.  Mutual  recursion  partitions  the  set  of  derived 
into  disjoint  blocks  of  mutually  recursive  predicates. 

Example:  {p,  q},  {pj},  and  {p2}  are  the  disjoint  blocks  of  mutually  recursive  predi¬ 
cates  in  figure  4.1.  [j 

The  predicates  in  a  block  must  be  evaluated  as  a  whole.  Mutual  recursion  groups 
together  rules  needed  to  evaluate  the  predicates  in  a  block. 

Example:  The  rule  partitions  for  figure  4.1  are  {ftj,  i?2,  /?7},  {Rv  R4},  and  {/?5, 
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4.3.  Rule  Representation 

Rules  are  typically  represented  in  a  graph  formalism.  Predicate  Connection 
Graphs  (PCGs)  are  an  example  of  such  a  formalism.  PCGs  were  proposed  by  McKay 
and  Shapiro  [McKa8lj  as  a  representation  to  facilitate  reasoning  with  recursive  rules.  A 
PCG  is  a  directed  graph  that  represents  the  relationships  between  head  predicates  and 
body  predicates.  Each  node  in  the  PCG  represents  a  predicate.  Edges  arise  from  rules. 
If  there  is  a  rule  of  the  form,  p  ~  pv  pv  ...,  pn,  there  is  a  directed  edge  from  p  to  each 
of  the  pt's.  Figure  4.2  shows  the  PCG  for  the  data/knowledge  base  of  figure  4.1. 


Figure  4.2.  Predicate  Connection  Graph  for  figure  4.1 

The  definitions  of  the  previous  section  can  be  recast  in  terms  of  the  PCG.  The  blocks 
of  mutually  recursive  predicates  are  the  strongly  connected  components  of  the  PCG.  A 
strongly  connected  component,  also  called  a  clique ,  of  a  graph  is  a  set  of  nodes  such 
that  there  is  a  directed  path  between  each  pair  of  nodes. 


In  the  context  of  D/KB  query  processing,  we  will  use  a  somewhat  broader 
definition  of  a  clique.  Here,  by  clique  we  will  mean  a  set  of  mutually  recursive  predi¬ 
cates  as  well  as  the  rules  needed  to  evaluate  these  predicates.  Obviously,  some  of  these 
rules  will  be  recursive.  The  rest  are  called  exit  rules.  Exit  rules  ensure  that  the  evalua¬ 
tion  of  a  recursive  predicate  terminates.  Therefore,  every  clique  must  have  at  least  one 
exit  rule. 

Example:  In  figure  4.1  we  need  rules  R3  and  R4  to  evaluate  py  R3  is  a  recursive  rule, 
while  /?4  is  an  exit  rule,  [j 

Figure  4.3  shows  the  cliques  for  the  sample  D/KB  shown  in  figure  4.1. 
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Figure  4.3:  Cliques  for  figure  4.1 


4.4.  Top-Down  vs.  Bottom-Up  Evaluation 

There  are  essentially  two  strategies  for  evaluating  Horn  clause  queries  —  top-down 
and  bottom-up  [Banc86].  We  will  illustrate  these  strategies  via  an  example.  Consider 
the  following  rules 

Rv  r(X,Y)^s(X,Z),bl(Z,Y). 

R2:  r(X,  Y)  -  b2(X,  Y). 

R3.  s(X ,  Y)  -  b3(X,  Y). 

R4:  query(X)  -  r(X,  "a"). 

Here,  the  bf's  are  base  predicates  and  r,  s,  and  query  derived  predicates.  Suppose  we 
want  to  evaluate  the  predicate  query.  In  bottom-up  evaluation,  we  start  with  the  base 
predicates  in  the  body  of  rules  and  keep  combining  them  with  other  predicates  in  the 
body  to  produce  the  head  predicates.  VV'e  stop  when  query  is  generated.  For  the  above 


rules,  we  get  s  from  b 3.  We  combine  s  with  bl  to  get  a  partial  result  for  r.  We  get 
another  partial  result  from  br  Finally,  we  take  the  union  of  these  partial  results  to  get 
all  the  values  of  r.  We  then  apply  a  selection  on  r  to  get  query,  i.e.,  the  values  of  X. 

In  top-down  evaluation,  we  start  with  the  predicate  to  be  evaluated  and  keep 
evaluating  predicates  in  the  body  of  rules  defining  this  predicate.  During  this  process, 
we  propagate  information  till  we  reach  the  base  predicates.  For  the  above  rules, 
evaluating  query  means  evaluating  r  with  its  second  argument  bound  to  ”a. "  We  use 
rule  ft,  to  propagate  this  binding  to  bl  and  R2  to  propagate  it  to  br  From  bv  we  get 
a  partial  result  for  X.  From  bv  we  get  values  for  Z,  which  produces  bindings  for  the 
second  argument  of  s.  These  bindings  are  propagated  using  rule  R3  to  63,  which  pro¬ 
duces  another  partial  result.  The  union  of  these  partial  results  then  gives  all  the  values 
of  X. 

Bottom-up  strategies  are  simpler  and  easy  to  implement,  but  they  compute  a  lot  of 
useless  results,  since  they  do  not  use  knowledge  about  the  query  to  restrict  the  search 
space.  On  the  other  hand,  top-down  strategies  are  more  efficient  since  they  use 
knowledge  about  the  query  to  propagate  information,  but  they  are  more  complex  and 
harder  to  implement. 

Several  optimization  strategies  have  been  proposed  for  use  with  bottom-up  stra¬ 
tegies  [Beer86,  Banc86 . Banc86,  Sacc86,  Sacc86 . ].  These  strategies  utilize  query 

information  to  restrict  the  search  space,  while  at  the  same  time  enjoying  the  ease  of 
implementation  advantage  of  bottom-up  strategies.  Therefore,  we  will  focus  only  on 
bottom-up  strategies. 

4.5.  Evaluating  Nonrecursive  Predicates 

Bottom-up  evaluation  of  a  nonrecursive  predicate  can  be  achieved  by  a  straightfor¬ 
ward  compilation  to  relational  algebra.  For  example,  evaluating  the  predicate  r  defined 
by  the  rules 

r(X,  Y)~s(X,  Z ),  b(Z,  Y). 
r( X,  Y)  -  e(X,  Y). 

is  equivalent  to  evaluating  the  following  relational  algebra  expression 

e  U  1,6.2  (a  ^  b) 

r.  2  =  6. 1 

4.6.  Evaluating  Recursive  and  Mutually  Recursive  Predicates 

Bottom-up  evaluation  of  a  recursive  predicate  involves  computing  the  least  fixed 
point  (LFP)  of  a  recursive  equation  [Emde76].  For  example,  evaluating  the  recursive 
predicate  px  in  figure  4.1,  which  is  defined  by  the  Horn  clauses 

P,( A'.  Y)  -  by(X,  Z),  Pl(Z,  Y). 
p,(X,  Y)  -  b4(X,  Y). 
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corresponds  to  evaluating  the  LFP  of  the  following  recursive  equation 
Pi  =  6  4  U  *6,  l,Pl  2  (61  Pi) 

*,  2  =  p,  1 

=  6  4  U  61  0  Pi 

-  /(P,) 

where  the  operator  ’O’  is  called  the  composition  operator,  which  is  a  join  followed  by  a 
projection  on  the  non-join  columns. 

As  defined  by  Ullman  [Aho79],  a  least  fixed  point  of  the  equation  =  /(pj  is  a 
relation  px  *  that  satisfies 

(i)  Px*  =  f{Px*)  and 

(ii)  if  p,  is  a  relation  such  that  px  =  f(px),  then  p{*  C  pr 

In  general,  a  recursive  equation  may  not  have  a  least  fixed  point.  However,  if  the  func¬ 
tion  /  is  monotone  in  the  sense  that 

ifpj  C  p2  then  f(px)  C  /(p2) 

it  is  guaranteed  to  have  a  least  fixed  point  [Tars55]. 

The  nice  property  about  Horn  clauses  is  that  the  function  /  consists  only  of  union 
and  composition  operators  and  is  therefore  monotone.  This  means  that  recursive  rules 
are  guaranteed  to  have  a  least  fixed  point1. 

Algorithm  1  (see  [Aho79,  Bancofi])  is  a  natural  way  to  compute  the  LFP  of  the 
equation  r  =  /  (r). 


repeat 

rJ  +  1  =  /(r;) 
until  (r;  +  1  -  r;  =  <j>); 

Algorithm  1:  Naive  Evaluation  of  a  Recursive  Equation 

This  method  is  called  naive  evaluation  [Banc86].  Naive  because  the  entire  relation  r;  is 
used  to  compute  r3^  even  though  the  monotonocity  of  /  and  r  being  equal  to  cj> 
ensure  that  r;  C  r3  +  1,  i.e.,  tuples  of  r;  are  also  present  in  r3  \ 
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A  more  efficient  procedure  is  to  compute  only  the  difference  between  r1  +  l  and  r3 
during  each  iteration.  This  procedure  is  called  semi-naive  evaluation  [Banc85]  (see  algo¬ 
rithm  2  below). 


®  _  i 

r  =  <p; 


8r°  =  /(d)); 


while  (8rJ  #  <f>)  do  { 
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Algorithm  2:  Semi-naive  Evaluation  of  a  Recursive  Equation 

As  we  mentioned  before,  a  set  of  mutually  recursive  predicates  must  be  solved  together 
as  a  whole.  This  involves  finding  the  LFP  of  a  set  of  recursive  equations. 

Example:  Evaluating  p  and  q  in  figure  4.1  involves  finding  the  LFP  of  the  following 
recursive  equations 

P  =  63  U  Pi  G  9  =  /i(P-  9) 

q  =  P  O  p2  =  f2(p,  q)  [] 

In  general,  evaluating  a  set  of  mutually  recursive  predicates  rt,  ...,  rn  will  involve 
finding  the  LFP  of  a  set  of  recursive  equations  of  the  form 

ri  =  / l(rr  rn) 


rn  =  f n(ri>  •"«  rn) 

The  LFP  is  guaranteed  to  exist  since  the  functions  /.  are  all  monotone  in  the  case  of 
Horn  clauses. 

Algorithm  3  shows  the  naive  evaluation  procedure  for  this  set  of  equations,  while 
algorithm  4  shows  the  semi-naive  evaluation  procedure. 

Algorithms  1-4  are  basically  relational  algebra  programs  that  execute  against  the 
set  of  base  relations.  That  is,  they  compute  the  tuples  of  the  derived  relations  given  the 
base  relations.  The  efficiency  of  this  program  is  strongly  dependent  upon  the  interface 
to  the  DBMS.  If  the  DBMS  interface  is  relational  algebra,  the  above  algorithms  must 
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<  =  to 
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- 1  _ 


f  n  (  r  1  ’  --  rn); 


until  (rj*  —  rj  =  <t>)  for  i  =  1,  ....  n; 

Algorithm  3:  Naive  Evaluation  of  a  Set  of  Recursive  Equations 

be  executed  as  application  programs  since  relational  algebra  cannot  express  least  fixed 
point  queries  [Aho79].  During  each  iteration  several  temporary  tables  would  have  to  be 
created  and  dropped.  Also,  checking  for  termination  of  the  iteration  involves  set 
difference,  a  costly  operation.  (In  the  next  chapter,  we  describe  naive  and  semi-naive 
LFP  evaluation  algorithms  with  relational  algebra  as  the  DBMS  interface.  We  have 
used  relational  algebra  as  the  DBMS  interface  in  the  VLPDF  demonstration  testbed, 
since  this  testbed  is  built  on  top  of  an  existing  relational  DBMS). 

On  the  other  hand,  if  the  DBMS  interface  allows  expressing  the  above  system  of 
recursive  equations,  the  DBMS  can  better  optimize  the  LFP  computation,  avoiding  these 
overheads.  Also,  the  DBMS  may  be  able  to  optimize  certain  forms  of  these  recursive 
equations  (e.g.,  transitive  closure)  better  than  others.  The  issues  then  are:  What  are 
these  forms?  What  is  the  best  way  of  implementing  them?  Which  of  these  forms  do  we 
include  in  the  DBMS  interface?  How  should  this  be  done?  These  issues  significantly 
affect  the  efficiency  of  LFP  computation,  and  thereby  D/KB  query  processing  perfor¬ 
mance.  Therefore,  the  DBMS  interface  is  a  very  critical  design  parameter  for  the 
D/KBMS  architecture. 

Another  way  of  improving  the  efficiency  of  LFP  computation  is  to  restrict  the 
search  space,  i.e.,  to  select  only  those  tuples  of  the  relations  r  ,  »  =  1,  2,  ...,  n,  that  are 
needed  for  the  computation.  In  the  next  section,  we  discuss  various  optimization  tech¬ 
niques  that  have  been  proposed  to  restrict  the  search  space. 
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Algorithm  4:  Semi-naive  Evaluation  of  a  Set  of  Recursive  Equations 

4.7.  Optimization 

In  this  section,  we  discuss  optimization  strategies  meant  for  use  with  bottom-up 
evaluation.  These  strategies  improve  the  efficiency  of  LFP  computation.  Several  stra¬ 
tegies  have  been  proposed,  e.g.,  magic  sets  [Banc86...],  supplementary  magic  sets 

(Sacc86j,  counting  and  supplementary  counting  [Sacc86 . ].  The  main  idea  behind 

these  strategies  is  the  use  of  stdeways  information  passing  to  restrict  the  computation  to 
tuples  that  are  related  to  the  query.  Beeri  and  Ramakrishnan  [Beer86]  have  developed  a 
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uniform  framework  to  describe  and  compare  these  strategies  and  to  understand  the 
basic  ideas  that  are  common  to  them.  They  first  formalize  the  notion  of  sideways  infor¬ 
mation  passing.  Then  they  describe  four  strategies  in  terms  of  this  formalism.  They 
call  these  strategies  generalized  magic  sets,  generalized  supplementary  magic  sets,  gen¬ 
eralized  counting,  and  generalized  supplementary  counting.  We  will  describe  their  for¬ 
malism  and  the  generalized  magic  sets  strategy  later  in  this  section.  But  first  we  will 
give  a  flavor  of  sideways  information  passing  and  optimization. 

Given  bindings  for  some  variables  of  a  predicate,  we  can  evaluate  the  predicate 
with  these  bindings.  This  evaluation  generates  bindings  for  the  other  variables  of  the 
predicate.  These  new  bindings  can  be  passed  to  another  predicate  in  the  same  rule  to 
restrict  the  computation  for  that  predicate. 

Example:  Given  the  rules 

Ry  r(X,  Y)  -  s( X,  Z),  bx(Z ,  Y). 

R2:  r(X,  Y)  -  b2(X,  Y). 

Ry  *(•*•  >')  -  MA'-  >')• 

the  query 

R4:  query [X)  -  r{ X.  "a"). 

gives  a  binding  for  the  second  argument  of  r.  Since  the  second  argument  of  r  is  the 
same  as  the  second  argument  of  6p  this  binding  restricts  the  computation  of  Z>r  After 
evaluating  b  r  we  get  bindings  for  Z,  which  in  turn  can  be  passed  to  s  to  restrict  the 
computation  of  s .  [j 

In  terms  of  relational  algebra,  sideways  information  passing  corresponds  to  pushing 
selections  down  a  relational  algebra  tree. 

Example:  Evaluating  the  above  query  corresponds  to  evaluating  the  following  rela¬ 
tional  algebra  expression 

W1  CT2  =  V  (62  U  K,4  b3  bl)) 

2  =  3 

Use  of  sideways  information  passing  corresponds  to  evaluating 

CT2=  V62  U  K,4  63  !X  K=.a.b,))) 

2  =  3 

where  the  <t2  =  ..8,.  selection  has  been  pushed  down  to  restrict  the  number  of  tuples  of  b2 
and  by  [] 

As  seen  in  the  above  example,  sideways  information  passing  is  easily  accomplished 
for  nonrecursive  rules.  However,  for  recursive  rules  the  situation  is  more  complicated. 
We  cannot  simply  push  the  selection  down  since  we  run  the  risk  of  losing  result  tuples 
during  the  LFP  computation.  Consider  the  following  data/knowledge  base  and  query 


ancestor( X,  Y)  -  parent(X,  Y). 

ancestor(X ,  Y)  -  parent{X,  Z),  ancestor(Z ,  )'). 

parent("john" ,  "jack"),  parent("  john" ,  "mary"), 
parent(" jack" ,  "evan"),  parent(" jack" ,  "tllen"), 
parent("mary" ,  "brian"),  parent  ("  mary" ,  "ann"), 
parent(" joe" ,  "charles"),  parent(" joe" ,  "diana"), 
parent(" charles" ,  "ben"),  parent (" charles" ,  " jan ") 

query(X)  -  ancestor  ("  john" ,  X). 

Evaluating  this  query  is  equivalent  to  evaluating 

°l  =  >An"  =  parent  (J  ^parent  X  ancestor)) 

2=3 

The  LFP  computation  will  yield  the  following  tuples  for  ancestor. 

ancestor  ("  john"  ,  "jack"),  ancestor  ("  john" ,  "mary"),  ancestor  ("  john"  ,  "evan"), 
ancestor  {"john" ,  "ellen"),  ancestor  ("  john" ,  "brian"),  ancestor  ("  john” ,  "ann") 

ancestor  ("  jack" ,  "evan"),  ancestor  ("  jack" ,  "ellen") 
ancestor  ("mary" ,  "brian"),  ancestor  ("mary"  ,  "ann") 

ancestor  ("  joe" ,  "charles"),  ancestor  ("joe" ,  "diana"),  ancestor  ("  joe" ,  "ben"), 
ancestor  ("  joe"  ,  "jan"),  ancestor  ("charles" ,  "ben"),  ancestor  ("  charles" ,  " jan ") 

Applying  the  selection  cr  1=  -;oAn’  to  this  relation  then  yields  the  following  results  for  the 
query 

ancestor  ("  john" ,  "jack"),  ancestor  ("  john" ,  "mary"),  ancestor  ("  john" ,  "evan"), 
ancestor (" john" ,  "ellen"),  ancestor  (" john" ,  "brian"),  ancestor  (" john" ,  "ann") 

Let  us  see  what  happens  if  we  try  to  push  the  cri  =  "J0^n"  selection  through  the  LFP 
operation  down  to  the  parent  relation.  We  would  then  be  evaluating  the  following  LFP 
expression 

LFP(anees<or  =  ol  =  -jehn~ parent  y  irM((<Tl  =  ";-<,*rr parent)  X  ancestor)) 

2  =  3 


This  LFP  computation  yields  the  following  tuples 

ancestor  ("  john"  ,  "jack"),  ancestor  ("  john" ,  "mary") 


which  is  obviously  not  the  correct  answer  to  the  query. 

The  source  of  the  problem  is  that  selection  commutes  across  a  cartesian  product 
A  x  B  only  if  it  applies  either  entirely  to  A  or  entirely  to  B.  Therefore,  we  cannot 
apply  the  selection  to  parent  prior  to  evaluating  the  join  of  parent  and  ancestor. 

What  we  need  is  a  way  of  determining  prior  to  the  LFP  computation  all  the 
relevant  tuples  of  the  parent  relation  that  will  be  needed.  Let  us  first  introduce  some 
definitions  to  clarify  the  notion  of  relevant  facts.  These  definitions  are  from  [Banc86]. 
A  fact  p(a )  is  relevant  to  a  query  q  if  p(a)  is  reachable  from  q(b)  for  some  b  in  the 
answer  set.  A  sufficient  set  of  relevant  facts  for  a  query  is  a  set  of  facts  such  that 
replacing  the  extensional  knowledge  base  with  this  set  of  facts  gives  the  same  answer  to 
the  query.  A  set  of  potentially  relevant  facts  is  a  superset  of  the  set  of  relevant  facts.  A 
set  of  potentially  relevant  facts  is  valid  if  it  contains  a  sufficient  set  of  relevant  facts. 

In  general,  it  is  impossible  to  find  all  the  relevant  facts  for  a  query  without  expend¬ 
ing  as  much  effort  as  is  needed  to  evaluate  the  query  itself.  The  optimization  strategies 
for  bottom-up  evaluation,  therefore,  compute  only  a  valid  set  of  potentially  relevant 
facts.  The  major  metric  for  evaluating  an  optimization  strategy  is  the  difference 
between  the  size  of  this  set  and  that  of  the  sufficient  set  of  relevant  facts  contained  in 
it. 

The  optimization  strategies  mentioned  at  the  beginning  of  this  section  are  all  rule 
rewriting  strategies.  The  new  set  of  rules  is  equivalent  to  the  original  set  but  its  LFP 
computation  is  more  efficient.  As  we  mentioned  before,  [Beer86]  have  developed  a 
unified  framework  for  describing  and  comparing  these  strategies.  They  have  also 
developed  four  strategies  —  generalized  magic  sets,  generalized  supplementary  magic 
sets,  generalized  counting,  generalized  supplementary  counting  —  in  terms  of  this  frame¬ 
work.  Here,  we  will  describe  the  generalized  magic  sets  strategy.  But  first,  we  need  to 
describe  the  formal  meaning  of  sideways  information  passing. 

4.7.1.  Sideways  Information  Passing 

Following  [Beer86],  let  r  be  a  rule  with  head  predicate  h.  If  a  predicate  occurs 
more  than  once  in  the  body  of  r,  we  number  its  occurrences.  Let  P{r)  denote  the  set 
that  contains  the  head  predicate  and  the  predicates  occurring  in  the  body  of  r. 

A  sideways  information  passing  strategy,  called  a  sip,  for  a  rule  r  is  a  labeled 
directed  acyclic  graph  where: 

(i)  Each  node  is  either  a  member  of  P{r)  or  a  subset  of  P{r). 

(ii)  Each  arc  is  of  the  form  N  -  p  where  N  is  a  subset  of  P(r),  and  p  is  a  member  of 

P(r).  Arcs  are  such  that  there  are  no  cycles  in  the  sip. 

(iii)  Each  arc  has  a  label  x,  which  is  a  set  of  variables  each  of  which  appears  in  some 

member  of  A'. 

Since  the  sip  is  acyclic  there  exists  a  total  ordering  of  the  predicates  in  P(r)  such  that 
for  each  arc,  all  members  of  its  tail  appear  before  its  head.  The  predicates  in  the  rule 
are  evaluated  according  to  this  total  order.  The  evaluation  is  done  as  follows.  For  each 
arc  Arf  -  p  with  label  entering  p,  we  compute  the  join  of  the  predicates  in  (some 
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arguments  of  these  predicates  may  be  bound  to  constants).  These  values  are  passed  to 
the  predicate  p  and  are  used  to  restrict  its  computation. 

Example: 

R ancestor(X ,  >')  -  parent(X,  T). 

R2 :  ancestor(X,  Y)  -  parent (X,  Z ),  ancestor.  l(Z ,  T). 

query  (X)  -  ancestor  ("  john" ,  X). 

We  have  numbered  the  occurrence  of  ancestor  in  the  body  of  R2  to  distinguish  it  from 
the  head.  The  natural  way  to  use  R2  is  to  evaluate  predicates  in  the  indicated  order, 
passing  bindings  from  one  predicate  to  another.  This  strategy  can  be  represented  by 
the  following  sip 

{ancestor,  parent)  -  ancestor.  I,  \  =  Z  [] 

Example 

sg{ X,  Y )  -  flat(X,  }’). 

R2:  sg(X,  Y)  -  up(X,  7,),  sg.  1(7,.  Z2),  flat(Z2,  Z3),  sg.2(Zy  Zj,  down(Z4,  Y). 
query  (X)  -  sg{"  john" .  A'). 

Here  also  it  is  natural  to  evaluate  the  predicates  in  /? 2  in  the  indicated  order.  We  show 
two  possible  sips  below  . 

sip  1:  {sg .  up)  -  sg.  1  x  =  Z , 

sip 2:  {sg,  up)  -  sg.  1  X  =  Z , 

{sg ,  up,  sg.  1,  flat}  -  sg.2  X  =  Zz 

In  sip  1 ,  values  of  Z ,  are  used  to  restrict  the  evaluation  of  sg.  1.  In  sip2,  in  addition, 
values  of  Z3  are  used  to  restrict  evaluation  of  sg.  2.  [j 

Beeri  and  Ramakrishnan  in  their  paper  just  describe  what  a  sip  is.  They  do  not 
present  an  algorithm  for  generating  a  sip,  given  a  rule  and  a  query.  We  have  developed 
such  an  algorithm,  which  we  will  describe  in  the  next  section  as  part  of  the  adorned  rule 
set  generation  algorithm. 

4.7.2.  Adorned  Rule  Set 

The  first  step  in  rewriting  the  set  of  rules  into  a  more  efficient  form  is  to  generate 
the  adorned  rule  set.  To  understand  this  step,  we  need  to  introduce  some  definitions. 
An  adornment  of  a  predicate  with  arity  n  is  a  sequence  of  length  n  of  b's  and  /’ s 
[Ullm85].  The  adornment  indicates  which  arguments  are  to  be  considered  as  bound 
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during  the  evaluation  of  a  predicate  and  which  will  get  values  as  a  result  of  the  evalua¬ 
tion.  The  bound  arguments  are  denoted  by  b,  while  those  that  will  get  values  as  a 
result  of  the  evaluation  are  denoted  by  /  in  the  adornment. 

Example:  The  sequence  fbbf  is  an  adornment  of  a  predicate  with  arity  4.  [] 

Example:  The  adornment  fbbf  indicates  that  the  second  and  third  arguments  are  to 
be  considered  bound  and  that  the  first  and  fourth  arguments  will  get  values  as  a  result 
of  the  evaluation,  [j 

An  adorned  predicate  is  a  predicate  augmented  with  its  adornment,  e.g.,  p^bb^ .  An 
adorned  rule  is  a  rule  where  (1)  the  head  predicate  is  adorned,  and  (2)  the  derived  predi¬ 
cates  in  the  body  are  adorned  (thus,  only  derived  predicates  have  an  adorned  version). 

Example:  pb^(X,  Y )  -  b(X,  Z),  s^b(Y,  Z ),  where  6  is  a  base  predicate  and  s  a 
derived  predicate.  [] 

We  remarked  earlier  that  informally  a  sip  describes  how  bindings  are  passed 
between  predicates.  An  adorned  rule  formalizes  this  description.  For  example,  the 
above  adorned  rule  says  that  to  evaluate  p  with  A"  bound  and  F  free,  we  evaluate  6 
with  A'  bound  to  get  values  for  Z .  These  values  values  are  then  used  to  evaluate  s  with 
Z  bound  and  Y  free. 

The  process  of  generating  the  adorned  rule  set  starts  with  the  query.  The  con¬ 
stants  in  the  query  define  adornments  for  the  query  predicates.  For  example,  the  query 
query  (X)  -  ancestor  ("  john' ,  A')  defines  the  adornment  6/  for  ancestor. 

For  each  adornment  a  of  a  derived  predicate  p,  we  create  an  adorned  predicate  p“. 
We  then  generate  adorned  rules  defining  p“.  The  adorned  rules  are  generated  using  the 
original  rules  defining  p.  We  use  the  following  recursive  procedure  to  generate  the 
adorned  rules  defining  an  adorned  predicate  pa .  For  each  rule  r  with  head  p, 

(i)  Generate  a  sip  for  this  rule  corresponding  to  the  adornment  a  using  the  sip  genera¬ 
tion  algorithm  described  in  section  4.7.2. 1. 

(ii)  Generate  a  new  rule  with  head  pa . 

(iii)  Replace  each  occurrence  of  a  derived  predicate  d  in  the  body  by  its  adorned  ver¬ 
sion.  We  do  this  as  follows.  Let  x  denote  the  union  of  the  labels  of  all  arcs  enter¬ 
ing  d  in  the  sip.  If  there  is  no  arc  entering  d,  \  is  set  to  empty.  The  adornment 
for  the  predicate  occurrence  d  is  ad  where  a  variable  of  d  is  bound  in  ad  if  it 
appears  in  x-  If  X  >s  empty,  the  adornment  ad  contains  only  /’ s. 

ad 

(iv)  Generate  adorned  rules  defining  d  if  they  have  not  been  generated  before.  [S 
4. 7, 2.1.  Sip  Generation  Algorithm 

As  we  mentioned  before,  Beeri  and  Ramakrishnan  in  their  paper  only  describe 
what  a  sip  is,  but  do  not  say  how  the  sip  is  to  be  generated.  We  have  developed  a  sip 
generation  algorithm,  which  we  describe  in  this  section. 

1.  Initialize  the  set  of  sip  arcs  to  empty. 
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2.  Generate  the  rule-pred  graph.  The  nodes  of  this  graph  are  predicates.  The  node 
corresponding  to  the  head  predicate  is  called  head.  An  edge  between  two  nodes 
means  that  the  corresponding  predicates  have  variables  in  common.  The  edges  are 
labeled  to  denote  the  common  variables. 

3.  For  each  derived  predicate  node  p,  find  all  simple  paths  from  head  to  p.  A  simple 
path  between  two  nodes  p  and  q  is  a  path  p,  pv  ...,  pn,  q,  where  each  node 
appears  only  once.  Each  such  path  is  a  potential  arc  in  the  sip.  For  example,  the 
path  {head,  pv  p2,  ...,  pn,  p)  corresponds  to  the  arc  {head,  pv  p2,  ...,  pn}  -  p. 
The  label  x  for  this  arc  is  obtained  as  follows.  For  each  node  q  in  the  tail  of  the 
arc,  if  there  is  an  edge  between  q  and  p  in  the  rule-pred  graph,  we  add  the  vari¬ 
ables  denoted  by  this  edge  to  x-  This  step  enumerates  for  each  derived  predicate  p 
all  the  possible  arcs  with  p  as  the  tail. 

4.  For  each  derived  predicate,  order  the  potential  arcs  in  descending  order  of  the 
number  of  predicates  in  the  tail.  Thus,  {head,  pv  p2}  -  p  comes  before 
{head,  q\  p . 

5.  Order  the  derived  predicates  in  descending  order  of  the  sum  of  the  number  of 
predicates  in  the  tail.  Thus,  if  p  has  arcs  {head,  p  ,  p,}  -  p  and  {head,  9}  -  p 
and  if  s  has  one  arc  {head,  pp  p2,  p3,  p4,  p.,  p6}  -  s,  then  s  comes  before  p. 

6.  For  each  derived  predicate  p  (as  per  the  order  of  step  5),  add  arcs 
{head,  pv  pv  ...,  pn}  -  p  (as  per  the  order  of  step  4)  if  (1)  the  edge  between  head 
and  p  contains  a  variable  that  is  bound  in  the  adornment  a,  (2)  adding  the  arc 
does  not  cause  a  cycle  in  the  sip,  and  (3)  adding  the  arc  causes  a  new  variable  of  p 
to  appear  in  the  label  x-  [] 

The  process  of  determining  the  set  of  arcs  for  sip  is  exponential  in  the  number  of 
derived  predicates  in  the  body  of  the  rule.  Steps  4  and  5  are  heuristics  that  tend  to 
favor  arcs  with  more  predicates  in  their  tail.  As  we  shall  see  when  we  describe  the  gen¬ 
eralized  magic  sets  algorithm,  this  will  keep  the  size  of  the  potential  set  of  relevant  facts 
closer  to  the  actual  set  of  relevant  facts. 

Another  feature  01  this  algorithm  is  that  each  arc  in  the  sip  will  have  head  in  its 
tail.  This  Mature  makes  possible  an  important  optimization  in  the  generalized  magic 
sets  algorithm.  See  [Beer86]  for  details. 

Example: 

Ry  ancestor{X ,  T)  -  parent(X,  V). 

R2:  ancestor(X ,  Y)  -  parent(X,  Z ),  ancestor.  1  {Z ,  Y). 

query  [X)  -  ancestor  ("  john" ,  X). 

The  query  gives  the  adornment  6/  for  ancestor.  The  sip  for  R2  for  this  adornment  is 
{ancestor,  parent}  -  ancestor.  1,  with  label  \  =  Z .  The  adorned  rules  are 

ancestor^  (X ,  Y)  -  parent}  X.  Y). 
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ancestor  ( X ,  Y)  -  parent(X ,  Z),  ancestor. Y  [Z ,  >’). 
queryl  (X)  -  ancestor^  {"  john" ,  X).  [] 

Example: 

sg(X,  Y)  -  flat(X,  Y). 

sg{X,  Y)  -  up(X,  ZJ,  sg.l{Zv  Z2 ),  flat{Zv  Z3),  sg.2{Z3 ,  ZJ,  down{Z 4,  T). 

fluery(Ar)  -  sg("john" ,  X). 

Using  the  sip 

{sy,  up}  -  sy.  1  x  =  ^ 

{«{/,  up,  sg.  1,  flat}  -  sy.  2  x  =  z3 

we  get  the  following  adorned  rules 
sgbf(X,  Y)  -  flat(X,  Y). 

sgbf{X,  Y)  -  up(X ,  Z,),  «y.l6/(Zlf  Z2),  flat{Zy  Z3),  sg.2bf{Z3,  ZJ,  down(Z4,  Y). 
queryf (X)  -  sgbf("john",  X).  [] 

Beeri  and  Ramakrishnan  give  a  formal  proof  that  for  a  given  query,  the  adorned 
rule  set  generated  using  the  above  algorithm  is  equivalent  to  the  original  rule  set. 

4.7.3.  Generalized  Magic  Sets 

The  second  (and  final)  step  in  the  rule  rewriting  transformation  is  to  define  addi¬ 
tional  predicates  that  compute  the  values  that  are  passed  between  predicates  according 
to  the  chosen  sip.  Each  of  the  original  rules  is  modified  by  including  these  additional 
predicates  in  the  rule  body.  This  ensures  that  a  rule  is  evaluated  only  when  the  values 
for  these  additional  predicates  are  available,  thereby  restricting  the  search  space.  The 
additional  predicates  are  called  magic  predicates  and  the  values  they  compute  are  called 
magic  sets.  The  rules  defining  magic  predicates  are  called  magic  rules.  The  original 
rules  modified  to  include  magic  predicates  in  their  body  are  called  modified  rules. 

We  describe  the  transformation  below. 

(i)  For  each  adorned  predicate,  pa,  we  create  a  new  predicate,  magic_pa.  The  arity  of 
magtc_p  is  equal  to  the  number  of  b’s  in  the  adornment  a.  The  arguments  of 
magic_p  correspond  to  the  bound  arguments  in  the  adornment. 

(ii)  For  each  adorned  rule  r  and  each  occurrence  of  an  adorned  predicate  pa  in  its 
body,  we  generate  a  magic  rule  defining  magic_pa .  Let  x  denote  an  argument  list. 


Then 


(respectively,  x  )  denotes  x  with  all  arguments  that  are  bound 
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(respectively,  free)  in  the  adornment  a  deleted.  Let  the  adorned  rule  r  be  defined 
as  follows 

r:  Paix)  ~  ?,'(x ,)-  9 2 " ( X 2 ) ’  --  <?/(xJ- 

Here  a,  a v  ...,  an  denote  adornments.  The  q  ' s  can  be  either  base  or  derived 
predicates.  If  qt  is  a  base  predicate,  its  adornment  a.  will  be  understood  to  be 
empty. 

Let  sr  be  the  chosen  sip  for  this  rule.  The  predicates  qt  are  assumed  to  be  ordered 
according  to  the  total  order  imposed  by  sf.  That  is,  predicates  participating  in  the 
sip  precede  those  that  do  not  and  for  each  arc  in  s  ,  the  predicates  in  the  tail  pre¬ 
cede  the  predicate  at  the  head. 

Consider  qt .  Either  there  is  only  one  arc  N  -  qt  in  the  sip  sr,  or  there  are  many. 
In  the  former  case,  we  generate  the  magic  rule  as  follows.  The  head  of  this  rule  is 

a  b  a 

magtc_qi'(\i).  For  each  q},  j  <  i,  add  ?/(x;)  to  the  body  of  the  magic  rule. 
Finally,  add  magic_pa (\b)  to  the  body. 

If  there  are  several  arcs  entering  qt,  we  proceed  as  follows.  For  each  arc  Ar  -  qt 
with  label  we  define  a  rule  with  head  labe^q^x ^).  The  body  of  this  rule  is  gen¬ 
erated  as  described  in  the  previous  paragraph.  The  magic  rule  is  then  defined  as  a 

a.  b 

rule  with  magic_qt  '(x, )  as  head  and  label_qt(x ;)  for  all  j  as  the  body. 

(iii)  Modify  each  adorned  rule  by  adding  a  magic  predicate  to  its  body.  The  magic 
predicate  corresponds  to  the  head  predicate. 

(iv)  Add  the  fact  magic_q  *(xq)  to  the  set  of  magic  rules,  where  the  query  is  q  with 
adornment  and  argument  list  xq-  [] 

Example: 

ancestor6! (X,  Y)  -  parent( X,  F). 

ancestor6!  (X ,  Y)  -  parent(X,  Z ),  ancestor6! [Z ,  F). 


query  (X)  -  ancestor  ("  john" ,  X). 

The  magic  rules  are 

magic  ^ancestor6 !  (Z)  -  magic  _ancestor6!  (X),  parent(X,  Z). 
magic_ancestor  b!  ("john"). 

The  modified  rules  are 

ancestor6^  (X ,  Y)  -  magic _ancestorb!  (X),  parent(X.  F). 
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ancestor^  ( X,  Y )  -  magic _anc estorb ^  (A'),  parent{ X,  Z ),  ancestor^  (Z ,  F).  [] 


We  note  the  following  points  from  this  example. 

(1)  For  the  parent  relation  mentioned  in  a  previous  example,  evaluating  the  magic 
rules  yields  the  following  tuples  for  magic_ancestorb * 


magic  _ancestorb^  (" john"),  magic_ancestorv‘  ("jack"),  magic_ancestor01  (" mary "), 
magic_ancestor b ?  (" evan " ),  magic_ancestorb ^  (" ellen " ),  magic_ancestorb ? (" brian ' ), 
magic_ancestor  ^  ("ann" ) 


V, 


These  tuples  constitute  the  magic  set  for  magic_ancestorb^ . 

(2)  The  magic  set  is  a  projection  on  the  first  column  of  the  set  of  relevant  tuples  of  the 
parent  relation. 

(3)  The  join  of  magic_ancestorb ^  and  parent  yields  all  the  relevant  tuples  of  parent 
needed  to  solve  the  query. 

(4)  Adding  magic _ancestorb ^  to  the  body  of  the  modified  rules  forces  this  join  to  be 
evaluated.  This  also  ensures  that  the  modified  rules  will  be  evaluated  only  after 
the  magic  set  has  been  computed. 

(5)  The  modified  rules  are  evaluated  using  only  the  relevant  tuples  of  the  parent  rela¬ 
tion  thereby  reducing  the  search  space.  As  a  matter  of  fact,  the  modified  rules  can 
be  written  as 

ancestor6^ (X ,  F)  -  relevant_parent{ X,  Y). 

ancestor11^  (X ,  Y)  -  relevant_parent(X ,  Z),  ancestor11^  (Z ,  Y). 


retevant_parent(X ,  Y)  -  magic_ancestorb^ (X),  parent (X,  Y). 

Beeri  and  Ramakrishnan  give  a  formal  proof  that  the  set  of  magic  rules  and  modified 
rules  is  equivalent  to  the  set  of  adorned  rules. 

4.8.  Conclusions 

This  chapter  described  the  important  concepts  pertaining  to  D/KB  query  process¬ 
ing.  In  our  work,  the  data/knowledge  base  is  considered  to  be  a  set  of  Horn  clauses  and 
schemas.  Definitions  pertaining  to  the  structure  and  composition  of  such  a  D/KB  were 
given.  Recursive  query  processing  was  seen  to  be  a  key  concept  differentiating  D/KB 
query  processing  from  traditional  database  query  processing.  The  concepts  of  recursive 
and  nonrecursive  predicates,  recursive  and  nonrecursive  rules,  reachability,  mutual 
recursion,  and  cliques  were  described.  Two  basic  strategies  for  Horn  clause  query 
evaluation  —  top-down  and  bottom-up  —  evaluation  were  then  described.  Top-down 
strategies  were  seen  to  be  more  efficient  but  more  complex  and  harder  to  implement. 
Bottom-up  strategies  were  seen  to  be  simpler  and  easy  to  implement  but  did  a  lot  of 
useless  work.  Bottom-up  evaluation  of  nonrecursive  predicates  was  shown  to  be 
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accomplished  via  a  straightforward  compilation  to  relational  algebra,  while  that  of 
recursive  predicates  involved  evaluating  the  LFP  of  a  set  of  recursive  equations.  Two 
basic  strategies  for  bottom-up  LFP  computation  —  naive  and  semi-naive  evaluation  — 
were  then  described.  Naive  evaluation  was  seen  to  be  more  inefficient  as  it  recomputed 
tuples  computed  during  previous  iterations.  Semi-naive  evaluation  was  seen  to  avoid 
much  redundant  work  by  computing  the  differential  of  the  right  hand  side  of  the  recur¬ 
sive  equations.  Finally,  the  concepts  relating  to  D/KB  query  optimization  were 
described.  Sideways  information  passing  to  restrict  the  search  space  to  the  relavant 
base  relation  tuples  and  rewriting  the  rules  in  the  D/KB  to  an  equivalent  form  whose 
LFP  computation  is  more  efficient  were  seen  to  be  the  basic  ideas  behind  D/KB  query 
optimization  strategies.  A  novel  sideways  information  passing  algorithm  was  described. 
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CHAPTER  5 


Transitive  Closure  Algorithms 


One  of  the  difficult  problems  in  the  design  of  a  D/KBMS  is  how  to  evaluate  recur¬ 
sive  queries  efficiently.  In  general,  the  solution  to  a  recursive  query  cannot  be  expressed 
as  a  finite  relational  algebraic  expression  and  therefore  cannot  be  evaluated  directly  by 
a  conventional  relational  database  system  [Aho79]. 

Among  the  large  family  of  recursive  queries,  the  transitive  closure  query,  a  query 
whose  processing  requires  the  computation  of  the  transitive  closure  of  a  database  rela¬ 
tion,  is  a  very  important  class  of  recursive  queries.  They  are  important  because  (1)  a 
large  number  of  recursive  queries  can  be  expressed  using  transitive  closures 
[Agra87,  Rose86],  (2)  most  application  problems  involving  recursive  queries  which  we 
can  see  now  are  actually  transitive  closure  queries,  and  (3)  efficient  processing  of  transi¬ 
tive  closure  queries  will  provide  a  sound  base  for  solving  more  complicated  recursive 
queries.  It  is  thus  not  surprising  that  much  effort  has  been  devoted  to  the  efficient  com¬ 
putation  of  the  transitive  closure  of  database  relations  recently  [Ioan86,  Rose86,  Vald8b]. 
There  is  even  a  tendency  to  extend  relational  algebra  to  include  the  operation  of  transi¬ 
tive  closure  in  relational  database  management  systems  [Agra87]. 

This  chapter  describes  our  evaluation  of  algorithms  for  computing  the  transitive 
closure  of  a  database  relation.  The  results  of  this  evaluation  appears  in  [Lu87].  Based 
on  this  investigation,  we  concluded  that  it  is  possible  to  further  optimize  transitive  clo¬ 
sure  processing.  This  led  us  to  develop  new  strategies  for  this  problem,  which  we  also 
describe  in  this  chapter. 

The  chapter  is  organized  as  follows.  Section  5.1  presents  definitions  and  back¬ 
ground  relating  to  transitive  closure.  Section  5.2  presents  four  algorithms  for  comput¬ 
ing  the  transitive  closure  of  a  database  relation:  the  Brute  Force  and  Logarithmic  itera- 
ti*e  algorithms  [Vald86],  Warshall’s  algorithm,  and  Warren’s  algorithm.  Section  5.3 
describes  two  implementations  of  the  Logarithmic  algorithm  and  one  implementation  of 
Warren’s  algorithm,  and  analyzes  their  performance.  Section  5.4  gives  the  results  of  our 
performance  comparison.  Section  5.5  presents  conclusions  from  the  evaluation  of  these 
algorithms.  Section  5.6  presents  two  new  transitive  closure  evaluation  strategies. 


5.1.  Definitions  and  Background 

If  R0(a,b)  is  a  database  relation,  its  transitive  closure  R 

R  =  R 0+  =  |J  R' 


=  R  0  is  defined  by 


where  R'  denotes  the  »th  power  of  RO:  Rl  =  R 0  and  Rn  -  Rn  1  O  R  for  n  >  1.  The 
composition  operator  O  on  the  two  binary  relations  R  and  S  is  defined  by 
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Using  relational  algebra,  this  composition  can  be  expressed  as 
R  O  S  =  TTRaiSb{  R  join  5  ) 

Rb  =  S  a 

Graphically,  relation  R 0  can  be  represented  as  a  directed  graph  G(V,E),  where  a  node 

t 

a  iV  represents  a  domain  value  of  a  £{R  O.A  ,/?0.fi},  and  a  directed  edge  e  in  E,  a~b , 
represents  a  tuple  (a, 6)  in  the  relation  R 0.  Then,  a  node  pair  (x,y)  is  in  the  transitive 
closure  of  R0,  R  (or  R0  )  whenever  there  is  a  path  of  nonzero  length  from  x  to  y. 
The  longest  path  length,  that  is,  the  largest  number  of  edges  comprising  a  path,  is 
sometimes  referred  to  as  the  depth  of  the  transitive  closure.  We  will  follow  the  same 
convention  in  our  discussion. 

More  formally,  the  transitive  closure  of  relation  R  0  represents  the  derived  relation 
R  defined  by  the  following  Horn  clauses: 

R{x,y)  -  R 0(x,y). 

R(x,y)  -  R(x,z),  R0(z,y). 

The  transitive  closure  can  be  used  to  evaluate  more  complex  recursive  queries. 
Consider  for  example  the  EMP_SAL  and  EMP_MGR  relations  shown  in  figure  5.1. 
Note  that  not  all  employees  have  managers.  Consider  the  query,  "For  each  manager, 
list  the  names  of  his  subordinates  (direct  or  indirect)  and  their  total  salary."  This  query 
can  be  expressed  in  SQL  (augmented  with  the  transitive  closure  function)  as  shown  in 
figure  5.2. 

Equation  (5.1)  is  the  basis  for  several  iterative  transitive  closure  algorithms 
[Vald86].  These  algorithms  compute  successive  approximations  to  the  right  hand  side  of 
equation  (5.1)  until  convergence  is  obtained. 

Several  years  ago,  Warshall  described  an  essentially  different  algorithm  for  comput¬ 
ing  the  transitive  closure  of  a  relation  [Wars62j.  Warshall’s  algorithm  was  originally 
designed  to  compute  the  transitive  closure  of  a  relation  represented  as  an  adjacency 
matrix.  An  adjacency  matrix  is  a  two  dimensional  Boolean  array  M,  where 
M{x,y)  =  true  whenever  [x,y)iR,  otherwise  M{x,y)=  false.  This  algorithm  has  the 
remarkable  property  that  it  can  compute  the  transitive  closure  in  a  single  pass  over  the 
matrix,  in  contrast  to  the  indefinite  number  required  by  iterative  algorithms.  Warren 
subsequently  modified  this  algorithm  to  give  it  better  performance  in  a  virtual  memory 
environment  [Warr75].  We  have  developed  an  implementation  of  Warren’s  algorithm  in 
which  the  relation  is  represented  as  a  set  of  tuples,  as  is  usual  in  relational  database  sys¬ 
tems.  In  this  form,  the  algorithm  can  be  integrated  into  a  relational  database  system 
for  the  purpose  of  evaluating  recursive  queries. 


EMP  SAL 


Ename 

Salary 

R.  Smith 

30k 

A.  Bailey 

40k 

B.  Sullivan 

40k 

N.  Johnson 

45k 

R.  Elliott 

35k 

K.  Doty 

40k 

C.  Shaffer 

45k 

T.  Benton 

50k 

J.  Kennedy 

43k 

N.  Sibell 

45k 

EMP_ 

MGR 

Ename 

Mname 

R.  Smith 

B.  Sullivan 

A.  Bailey 

X.  Johnson 

B.  Sullivan 

X.  Johnson 

N.  Johnson 

X.  Sibell 

R.  Elliott 

B.  Sullivan 

K.  Doty 

X.  Sibell 

C.  Shaffer 

T.  Benton 

J.  Kennedy 

X.  Sibell 

N.  Sibell 

T.  Benton 

Figure  5.1.  Relations  for  example  recursive  query. 

5.2.  Algorithms  for  Transitive  Closure 

In  the  following  descriptions  of  transitive  closure  algorithms,  R  is  the  binary  source 
relation  with  attributes  A  and  B.  T  is  the  result  relation  and  has  the  same  attribute 
names. 

5.2.1.  Brute  Force  Iterative  Algorithm 

This  version  of  the  Brute  Force  algorithm  works  directly  on  the  base  relations;  Val- 
duriez  and  Boral  also  present  a  version  of  the  algorithm  that  uses  a  join  index  to 
improve  processing  speed  [Vald86] .  We  have  modified  the  algorithm  to  work  properly 
with  cyclic  as  well  as  acyclic  relations,  for  a  fair  comparison  with  Warshall’s  and 
Warren’s  algorithms.  The  Brute  Force  algorithm  can  be  expressed  as  follows: 

T=R- 
Ry  =  R; 
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INSERT  INTO  T(Ename,  Mname) 

SELECT  * 

FROM  TRANSITIVE_CLOSURE(EMP_MGR); 

INSERT  INTO  U(Mname,  Ename,  Salary) 

SELECT  T. Mname,  T. Ename,  EMP_SAL. Salary 

FROM  T,  EMP_SAL 

WHERE  T. Ename  =  EMP_SAL. Ename 

SELECT  Mname,  Ename,  SUM(Salary) 

FROM  U 

GROUP  BY  Mname 

Figure  5.2.  Implementation  of  recursive  query. 

while  R ^ 9^0  do 
begin 

R  ±:  =  R  ; 

Ry  =  R±-T-, 

r:=flAur; 

end 

Note  that  the  set  union  R  A (J  T  is  a  disjoint  union.  After  i  iterations  of  the  while 
loop, 

=  i?'"1-  u  R} 

1S;<| 

T  =  |j  R1 

ISjSl + 1 


The  algorithm  terminates  when 
Ri  +  i C  \j  R 1 

IS;  St 

From  this  it  is  easy  to  show  that 


R1  =  |j  R1  =  R 


ISjSl+l 


The  number  of  iterations  required  can  be  expressed  in  terms  of  the  directed  graph 
defined  by  R.  Let  paths(z,y)  denote  the  set  of  paths  from  x  to  y  in  the  graph.  Let 
length(s)  denote  the  length  of  path  s.  i.e.,  the  number  of  edges.  Define  the  quantity  p 
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p  =  max  min  length(s) 

z,y  s  £  paths(i,y) 

paths(x,y)*>Z 

It  is  easy  to  show  that  (5.2)  holds  when  i'^p,  so  the  Brute  Force  algorithm  requires  p 
iterations  to  compute  the  transitive  closure  of  R . 

5.2.2.  Logarithmic  Algorithm 

The  following  version  of  Valduriez  and  Boral’s  Logarithmic  algorithm  works  with 
cyclic  relations: 

T:  =  R; 

Ry  =  R ; 

X:  =  R ; 

while  Xs*0  do 
begin 

^  X  =  R \  R y 

T  y  =  T  Ry 
X—Rx-  T : 

Y:=R±  UT> 

T:=Y\JTS 

end 

Again,  the  set  union  R ±(J  T  is  a  disjoint  union.  .After  x  iterations, 

T  =  U  R] 

=  ff2' 

X  =  R2'  -  y  R1 

1S;S2‘  -  1 

The  Logarithmic  algorithm  terminates  when 

r2'c  u  r1 

1<;S2'  -1 

This  occurs  when  2'-l>p,  i.e.,  when  »^lg(p  +  l).  Therefore  the  Logarithmic  algorithm 
requires  hg(P  +  1)1  iterations.  Valduriez  found  that  the  Logarithmic  algorithm  gen¬ 
erally  performed  better  than  the  Brute  Force  algorithm;  we  will  therefore  use  it  as  our 
iterative  algorithm  in  what  follows. 


5.2.3.  Smart  algorithms 

Ioannidis  recently  proposed  a  new  set  of  algorithms,  smart  algorithms,  to  compute 
the  transitive  closure  of  a  relation  [Ioan86].  A  frame  work  of  optimizing  the  computa¬ 
tion  along  the  same  direction  as  the  logarithmic  algorithm  was  provided.  According  to 
the  smart  algorithms,  the  transitive  closure  of  relation  RO  is  expressed  as 

^  m  -  1 

R+  =  n  (  2 
*  =  0  /  =  0 

With  a  different  m  value,  different  algorithms  can  be  obtained.  The  logarithmic  algo¬ 
rithm  is  actually  the  special  case  of  m  =2. 

R+  =  (  1  +  ftO  )(  l  +  /?02  ){  l  +  /?04  )  •  •  • 

5.2.4.  Warshall’s  Algorithm 

Warshall  proposed  a  quite  different  algorithm  for  computing  transitive  closure  on  a 
binary  relation  [Wars62].  In  Warshall's  algorithm,  a  binary  relation  R  is  represented  by 
a  boolean  adjacency  matrix  M.  With  this  representation,  Warshall’s  algorithm  com¬ 
putes  the  transitive  closure  of  the  relation  as  follows: 

for  j:  =  1  to  .V  do 

for  i :  —  1  to  .V  do 

if  then 

for  £:=I  to  A’  do  M(i ,k):=  \f(i ,/fc) 

This  algorithm  effectively  computes  the  transitive  closure  in  only  one  pass  over  M.  If 
the  matrix  is  stored  in  row  major  order,  and  if  each  row  is  represented  as  a  string  of 
bits,  then  the  inner  loop  of  this  algorithm  can  be  implemented  very  efficiently  using 
machine  instructions  that  compute  the  logical-or  of  two  words  or  bit  strings. 

Warshall's  algorithm  works  by  creating  ever  shorter  paths  between  two  nodes  in 
the  directed  graph  represented  by  R.  Suppose  R  contains  a  path  from  x  to  y.  Before 
the  iteration  j-z ,  T  contains  a  path  x,wx,  .  .  .  ,wn,y  such  that  wt>z  for  all  i.  If  2  is 
in  this  path,  then  this  iteration  creates  a  similar  path  in  T  with  2  removed.  When  the 
algorithm  terminates,  (x,y)£  T.  Warshall’s  paper  gives  a  more  formal  correctness  proof. 

5.2.5.  Warren’s  Algorithm 

Warren  proposed  an  improvement  to  Warshall’s  algorithm  in  a  paging  environment 
if  the  entire  matrix  will  not  fit  in  memory  (Wa'r75j.  Since  Warshall’s  algorithm  scans 
the  matrix  by  columns  and  updates  it  by  rows,  it  will  introduce  a  large  number  of  page 
faults  in  a  virtual  memory  environment  when  the  matrix  cannot  fit  in  real  memory. 
Warren’s  algorithm  avoids  this  problem  by  scanning  and  updating  the  matrix  by  rows. 
It  has  two  passes  instead  of  Warshall  s  one  pass.  However,  each  pass  is  over  only  half  of 
M .  Here  is  Warren's  algorithm: 
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for  »:  =  2  to  ;V  do 

for  j:  =  1  to  i  —  l  do 
if  then 

for  k:~  '  to  .V  do  Af(i ,k):  =  M(i ,k)'JM{j ,k) 

for  i  :  =  l  to  N—  1  do 

for  j:=  i+  1  to  N  do 
if  M(i,j )  then 

for  fc:  =  l  to  N  do  M(i ,k):  =  M(i ,k)vM(j ,k) 

A  formal  correctness  proof  of  the  algorithm  is  given  in  the  original  paper  [Warr75]. 

Warren’s  algorithm  can  be  represented  in  relational  database  terms.  The  domain 
of  R  assumed  in  the  original  implementation  is  the  range  of  integers  (1,  N].  In  fact,  any 
finite,  totally  ordered  domain  D  can  be  used  and  the  choice  of  total  order  > D  is  arbi¬ 
trary.  For  the  large  domains  commonly  occurring  in  database  relations,  the  adjacency 
matrix  representation  of  R  is  impractical;  the  set  of  tuples  representation  is  much  more 
compact.  To  achieve  the  effect  of  scanning  the  adjacency  matrix  by  rows,  we  maintain 
the  tuples  in  a  sequence  sorted  by  attributes  A  and  B  as  primal y  and  secondary  key, 
respectively.  This  gives  the  following  implementation  of  Warren's  algorithm: 

T  :=  R  sorted  by  attributes  <A,  B> 

for  t  6  T  {  in  sorted  order  }  do 
if  t.A  > D  t.B  then 

insert  {t}  a  A_t  B{T)  into  T; 

for  t  €  T  {  in  sorted  order  }  do 
if  t.A  <D  t.B  then 

insert  {t}  cr^  B{  T)  into  T; 


Tuples  inserted  into  T  must  be  inserted  in  sorted  order.  The  < D  and  >p  comparison 
operators  refer  to  the  total  ordering  of  the  domain  D.  The  expression  {t}  <r  A=l  B{T)  is 
the  composition  of  the  singleton  relation  consisting  of  tuple  t  with  the  tuples  of  T  whose 
first  attribute  matched  the  second  attribute  of  t.  All  of  the  tuples  in  the  result  of  this 
composition  have  t.A  as  their  first  attribute  value.  Inserting  these  tuples  into  T,  which 
is  clustered  on  attribute  A,  is  fairly  cheap.  Performing  the  selection  is  also 

fairly  cheap  for  the  same  reason. 

We  investigated  a  similar  implementation  of  Warshall’s  algorithm.  However,  the 
relation  T  could  not  be  clustered  in  such  a  way  that  the  selection  and  insertion  were 
cheap.  This  difficulty  reflects  the  original  version  of  Warshall’s  algorithm  scans  the 
matrix  by  columns  (clustered  on  B)  and  updates  the  matrix  by  rows  (clustered  on  A). 
Therefore  we  chose  to  evaluate  the  performance  of  Warren's  algorithm  only. 
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5.3.  Implementation  of  Transitive  Closure  Algorithms  and  Their  Costs 

In  this  section  we  discuss  the  implementation  details  of  the  transitive  closure  algo¬ 


rithms  and  derive  their  cost  formulas. 


5.3.1.  Basic  Operations  and  Their  Costs 

The  basic  operations  involved  in  the  iterative  algorithm  are  binary  relation  compo¬ 
sition,  set  union  and  set  difference.  Composition  is  a  join  followed  by  a  project,  so  we 
now  describe  the  implementation  of  join,  union,  and  difference,  and  derive  their  costs. 


5. 3. 1.1.  Join 

For  the  join  operation,  we  have  chosen  the  hybrid  hash  join  algorithm  [DeWi84] 
because  of  its  superior  performance.  Hybrid  hash  join  consists  of  two  phases,  partition¬ 
ing  and  probing.  In  the  partitioning  phase,  each  of  two  relations,  R  and  5,  are  parti¬ 
tioned  into  a  number  of  disjoint  buckets.  The  bucket  sizes  of  the  smaller  relation,  R, 
are  selected  such  that  a  hash  table  can  be  constructed  for  a  bucket  in  memory.  After 
this  table  is  constructed,  the  tuples  in  the  corresponding  bucket  of  relation  5  are  used 
to  probe  the  hash  table  to  find  matches.  In  the  case  that  the  hash  table  for  the  whole 
relation  of  R  can  not  fit  into  memory,  there  will  be  more  than  one  iteration  to  process 
the  buckets.  During  the  partitioning  of  R ,  the  tuples  of  one  bucket  remain  in  memory 
to  construct  the  hash  table;  tuples  of  other  buckets  are  written  back  to  the  disk.  Simi¬ 
larly,  when  relation  5  is  partitioned,  the  tuples  in  the  first  bucket  are  directly  used  to 
probe  the  hash  table  in  memory  and  others  are  written  back  to  the  disk.  The  buckets 
written  back  to  the  disk  are  read  in  again  to  construct  and  probe  the  hash  table  in  the 
later  iterations.  The  partitioning  phase  requires  extra  buffers  to  hold  tuples  being  col¬ 
lected  for  the  various  buckets.  However  we  assume  that  M  bytes  of  memory  are  avail¬ 
able  for  each  hash  table,  w'hether  or  not  partitioning  is  occurring  while  the  hash  table  is 
being  built.  Using  the  notation  in  Table  5.1,  the  cost  formulas  of  the  hash-based  join  of 
R  and  5  are  given  in  Table  5.2. 


5. 3. 1. 2.  Difference 

The  difference  operation  R  ~S  can  be  implemented  in  a  similar  way  as  the  hybrid 
hash  join.  In  this  case,  R  and  S  are  partitioned  using  a  hash  function  on  the  entire 
tuple.  For  each  bucket  of  5,  a  hash  table  is  constructed  in  main  memory.  This  table  is 
probed  with  the  tuples  of  the  corresponding  bucket  of  R ,  and  those  tuples  for  which 
there  is  no  match  are  added  to  the  result.  A  notation  similar  to  join  selectivity,  called 
difference  selectivity,  DS,  is  defined  as  the  ratio  of  the  number  of  tuples  in  the  result 
relation  of  the  difference  operation  to  the  number  of  tuples  in  the  source  relation  of  R . 
The  cost  of  this  algorithm  is  shown  in  Table  5.3. 


5. 3. 1.3.  Union 

We  can  use  a  hash-based  algorithm  to  perform  the  set  union  operation  R  (j5  in  a 
similar  way.  In  this  algorithm,  all  the  tuples  of  relation  5  and  the  tuples  from  relation 
R  that  don’t  match  with  tuples  in  5  are  moved  to  output  buffers  to  form  the  union 
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l  R  i 

Number  of  pages  in  relation  R 
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Number  of  tuples  in  relation  R 

JS 

1  1  R  joinS  l  1 

Join  selectivity  = 

1  \R  1  l-l  ISI  1 

US 

Ilf?  (J  5  1  ' 

Union  selectivity  = 

1  1  R  i  1  +  1  1  S  M 

DS 

II  R-S  ' 

Difference  selectivity  = 

1  \R  1  i 

DS, 

Selectivity  of  Difference  in  ilh  iteration 

TS 

Tuple  length  (in  bytes) 

PS 

Page  size  (in  bytes) 

^ comp 

Time  for  comparing  two  attribute  values 

^mot >t 

Time  for  moving  a  binary  tuple  in  memory 

^ ha$h 

Time  for  hashing  an  attribute 

Kead 

Time  for  reading  one  page  from  disk 

^u/r«fe 

Time  for  writing  one  page  to  disk 

Table  5.1:  Notations  Used  in  Cost  Formulas, 
output.  Table  5.4  shows  the  cost  of  this  algorithm. 

5. 3. 1.4.  Combined  Union  and  Difference 

The  union  and  difference  operations  both  partition  R  and  5  by  hashing  complete 
tuples,  so  we  can  combine  them  to  compute  R-S  and  i?(jS  simultaneously.  This  is 
useful  for  both  the  Brute  Force  and  the  Logarithmic  algorithm.  The  total  cost  of 
obtaining  these  two  copies,  denoted  as  Union_Diff(R,S ),  is  the  sum  of  the  cost  of 
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Table  5.2:  Cost  Join(R  ,5)  of  Hash-Based  Join. 

Diff(R,S)  in  table  5.3  and  the  terms  (9)  an  (10)  in  the  cost  table  of  Union(R,S),  Table 
5.4.  It  is  shown  in  Table  5.5. 

5.3.2.  Coat  of  Iterative  Algorithm 

The  iterative  algorithm  (  the  Logarithmic  algorithm  described  above)  requires 
fig  (P  +  1)1  iterations.  In  each  iteration  there  are  two  joins  (R^'R^  and  T'R^),  one 
union  ( T  T^)  and  one  combined  union  and  difference  (between  R ^  and  T ).  The  cost 
C.  of  iteration  t  is 


vv.v  vv 


(iH«) 

= 

—  the  same  as  (1)  -  (6)  in  Table  5.2  of 

Join  (R,  S) 

(7) 

+ 

—  Moving  tuples  of  5  to  build  hash 
tables  for  buckets  of  S 

(8) 

+  1 

'  «  !  1  'Ur 

—  Probing  for  match  tuples  in  S 

(9) 

+ 

1  \R  i  !  -DS-TS 

—  Moving  tuples  of  R  not  in  5  to  the 
output  buffers 

(10) 

+  ' 

^  write 

PS 

—  Writing  the  output  buffers  to  disk 

Table  5.3:  Cost  Dif  f  {R  ,S)  of  Hash-Based  Set  Difference. 
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Table  5.4:  Cost  Union(R  ,S)  of  Hash-Based  Set  Union. 
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—  the  same  as  (1)  -  (10)  of  Diff  (R,  S) 
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Table  5.5:  Cost  Union  -Dif  f  (R  ,S)  of  Combined  Hash-Based  Union  and  Difference. 
The  total  Cost  is 

C  =  1  ct 

isi<  f  lg(p  -l)  1 

5.3.3.  Improved  Iterative  Algorithm 

The  preceeding  cost  computation  assumed  that  each  relation  is  written  to  disk  as  a 
sequential  file  as  it  is  generated  by  a  join  or  other  relational  operation.  If  this  file  is 
used  as  input  to  a  subsequent  operation,  it  must  be  partitioned  into  buckets  again. 
This  partitioning  involves  reading  the  sequential  file  and  writing  all  but  one  bucket 
back  to  disk. 

We  propose  here  an  improved  implementation  of  the  iterative  algorithm  in  which 
each  relation,  as  it  is  generated,  is  partitioned  in  preparation  for  the  next  use  of  the 
relation.  This  technique  eliminates  writing  and  reading  the  sequential  file.  However,  it 
requires  more  memory  for  buffers  for  output  relation  buckets.  As  before,  we  assume 
that  there  are  relatively  few  buckets  so  that  the  memory  available  for  a  hash  table  is 
not  significantly  reduced. 

In  the  improved  iterative  algorithm,  relations  T,T^  and  Y  are  partitioned  on  attri¬ 
bute  B  whenever  they  are  generated.  When  relation  R ^  is  generated,  two  partitionings 
are  generated:  one  on  attribute  A  and  one  on  attribute  B.  Relation  X  is  not  parti¬ 
tioned.  In  fact,  it  need  not  be  materialized  since  we  don’t  need  to  know  its  exact  value, 
only  whether  or  not  it  is  empty.  These  partitionings  ensure  that  each  relational  opera¬ 
tion  can  proceed  immediately  with  bucket-by-bucket  processing. 

The  cost  of  improved  iterative  algorithm  is  computed  by  modifying  the  cost  for  the 
basic  relational  operations  to  reflect  partitioning  output  relations  instead  of  input  rela¬ 
tions.  The  cost  of  partitioning  T  and  T ^  prior  to  the  first  iteration  must  also  be 
included. 
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5.3.4.  Warren’s  Algorithm 

The  physical  implementation  of  our  version  of  Warren’s  algorithm  is  based  on  a 
particular  choice  for  the  total  order  >D  on  the  domain  D  of  relation  R.  Instead  of 
using  the  normal  order  relation  >  on  number  or  chaaracter  strings,  we  define  >0  as 
follows: 

x  >  Dy  <=>hash(x)>hash(y)  V  (hash(x)=  hash(y)  A  z>t/) 

That  is,  two  elements  of  the  domain  are  ordered  primary  on  their  hash  values  and 
secondarily  ordered  by  the  normal  ordering  for  numeric  or  string  attributes.  This 
choice  of  order  allows  us  to  use  hashing  techniques  for  sorting,  selection  and  insertion. 

Relation  T  is  physically  represented  as  a  hash  table  with  a  main  memory  pointer 
array  and  disk-resident  buckets.  The  number  of  buckets  is  such  that  one  page  of  each 
bucket  can  fit  into  main  memory. 

The  first  step  in  our  version  of  Warren's  algorithm  is  to  create  T  by  sorting  rela¬ 
tion  R  on  attributes  A  and  B.  Given  the  choice  of  total  order  on  D,  we  can  sort  R  by 
partitioning  it  into  buckets  based  on  the  hashed  value  of  attribute  A ,  and  then  sort 
each  bucket.  After  partitioning,  main  memory  is  ordered  as  a  cache  of  recently- 
retrieved  buckets. 

Table  5.6  shows  the  cost  of  partitioning  the  relation  and  the  cost  of  sorting  the 
buckets  during  processing.  Here  we  use  RQ,  Rx  and  R  to  represent  the  source  relation, 
the  result  relation  after  the  first  pass  and  the  final  result  relation,  respectively. 


(1) 

“  *0  'Ktad 

—  Reading  the  source  binary  relation 

(2) 

-*■  */?q  ( ^Rath 

—  Hashing  the  input  relation  into 
partitions 

(3) 

M 

*  «0  (i-  y^writt 

■  v 

—  Writing  overflow  buckets  to  disk 

(4) 

PS  PS 

+  I  i?0 1  -lg  -(t  +tmove) 

TS  TS 

—  Internal  sorting  of  all  the  pages 

MM 


ZW' 

.y-yAv 


v» 


>rder  of 

hash  value,  r  irst, 

in  the 

bucket  such  that 

he  in-memory  hash  table 

d  tuples 

are  in  the  same  or 

ecessary 

unless  the  earlier 

—  Reading  the  bucket  on  disk  fo 
processing 


—  Comparing  the  two  attribute  values 


II/?  II 

'  i  •  T5-M 

“  'Kead 

M  i'TS 

»  =  —  ‘i 

rs 


—  Hashing  and  lookup  the  directory 
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Table  5.7:  Cost  of  the  First  Pass  of  Warren’s  Algorithm. 

Table  5.8  shows  the  cost  of  the  second  pass  over  the  relation.  The  formulas  are 
similar  to  that  in  Table  5.7,  with  different  relation  sizes. 

As  mentioned  above,  Warren’s  algorithm  complete  the  transitive  closure  computa¬ 
tion  in  two  passes,  the  total  cost  of  the  algorithm  can  therefore  obtained  as 

c.  =  Co  +  C,  +  C, 

5.4.  Evaluation 

We  have  chosen  four  parameters  to  study  the  effect  of  their  variation  on  the  perfor¬ 
mance  of  the  three  transitive  closure  algorithms.  These  four  parameters  are  1)  the 
memory  size,  2)  the  source  and  the  result  relation  sizes,  3)  the  number  of  iterations  in 
the  iterative  algorithms,  and  4)  the  join  selectivity  We  have  calculated  the  total  cost  of 
each  of  the  three  transitive  closure  algorithms  for  various  sets  of  parameter  values.  The 
values  of  the  I/O  and  computation  parameters  have  been  fixed  in  all  our  experiments. 
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processing  tuple 
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writing  them  to  disk 


Table  5.8:  Cost  C2  of  the  Second  Pass  of  Warren’s  Algorithm. 


In  each  experiment,  we  have  varied  one  parameter  value  over  a  range,  while  keeping  the 
values  of  the  other  parameters  fixed  at  some  value.  The  value  ranges  and  the  typical 
values  we  have  used  in  our  evaluation  are  the  following: 

Figure  5.3  illustrates  the  increases  in  the  execution  time  of  the  three  algorithms  as 
the  total  available  memory  size  is  reduced.  Warren’s  algorithm,  as  we  expected  from  its 
original  nature,  performs  very  poorly  as  the  memory  size  is  reduced.  The  actual  cross¬ 
over  value  of  the  memory  size  at  which  Warren’s  algorithm  starts  performing  worse 
depends  on  the  source  and  result  relation  sizes  and  the  values  of  the  other  system 
parameters.  However,  we  can  see  a  significantly  better  performance  from  Warren’s 
algorithm  when  reasonable  memory  sizes  are  assumed  (e.g.  >  2  MB).  From  figure  5.3, 
we  can  observe  that  for  the  memory  sizes  exceeding  4  megabytes  (assuming  source  and 
result  relation  sizes  of  the  order  of  6  and  8  megabytes)  Warren’s  algorithm  performs  far 
better  than  the  iterative  algorithms.  Since  memory  sizes  of  few  megabytes  are  fairly 
typical  in  current  systems,  we  can  expect  Warren’s  algorithm  to  perform  better  than 
iterative  algorithms  in  most  typical  applications.  We  can  also  observe  from  the  figure 
that  the  improved  iterative  algorithm  performs  much  better  than  its  basic  counterpart. 
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Parameter  settings 

used  in  evaluation 

Parameter 

Values  (typical  value) 

Source  relation  i?0 

500KB- 10MB  (  2MB  ) 

Memory  size  M 

400KB-8MB 

Join  selectivity  JS 

10_fl-10"8(10~7) 

Union  selectivity,  US 

1.0  (no  duplication) 

Difference  selectivity,  DS 

1.0  (no  duplication) 

Number  of  iterations,  p 

1  -  62  (6) 

Page  size,  PS 

4K  Bytes 

Tuple  size,  TS 

8  Bytes 
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Figure  5.4  illustrates  the  changes  in  the  performance  of  the  transitive  closure  algo¬ 
rithms  as  the  source  relation  size  is  varied,  while  keeping  the  memory  size  fixed.  We 
can  observe  from  this  figure  a  similar  behavior  as  seen  from  figure  5.3,  which  is  as  the 
memory  size  available  for  holding  the  output  of  transitive  closure  becomes  limited,  the 
performance  of  Warren's  algorithm  deteriorates  rapidly. 

Figure  5.5  shows  the  effect  of  variation  in  the  number  of  iterations  required  to  com¬ 
pute  the  transitive  closure.  Since  W'arren’s  algorithm  is  not  iterative,  its  performance 
remains  the  same,  while  the  performance  of  the  iterative  algorithms  becomes  extremely 
worse  compared  to  warren’s  algorithm  when  the  number  of  iterations  required  becomes 
large.  Therefore,  if  large  number  of  successors  of  tuples  are  involved  in  computing  the 
transitive  closure,  it  is  better  to  use  warren’s  algorithm. 

Figures  5.6  and  5.7  illustrate  the  changes  in  the  performance  of  the  algorithms  as 
the  join  selectivity  for  each  iteration  is  increased  and  decreased  respectively.  The  per¬ 
formance  behavior  exhibited  in  both  these  figures  is  the  same.  In  both  figures,  as  the 
result  relation  size  increases,  iterative  algorithms  show  better  performance  than 
Warren’s  algorithm.  We  can  conclude  from  these  figures  that  the  values  of  join  selec¬ 
tivity  are  not  as  important  as  the  result  relation  sizes  in  affecting  the  relative  perfor¬ 
mance  of  the  transitive  closure  algorithms. 
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Figure  5.5.  Execution  Time  vs.  Depth  of  Transitive  Closure 
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5.5.  Summary  and  Discussion 

We  have  presented  in  this  section  an  adaptation  of  Warren's  algorithm  to  the  rela¬ 
tional  database  environment.  It  is  a  non-iterative  algorithm  and  computes  the  transi¬ 
tive  closure  of  a  relation  in  a  depth-first  search  fashion.  This  algorithm  was  compared 
with  a  logarithmic  iterative  algorithm  and  an  improved  version  of  the  logarithmic  algo¬ 
rithm.  The  motivation  for  our  study  of  transitive  closure  algorithms  and  their  perfor¬ 
mance  is  to  find  some  alternative  methods  in  recursive  query  processing.  As  we 
expected.  Warren's  algorithm,  which  is  basically  a  simple  depth-first  search  and  main 
memory  algorithm,  works  better  in  two  cases:  (i)  the  relative  size  of  relation  is  not 
much  larger  than  the  size  of  available  memory,  and  (2)  the  path  lengths  in  the  transi¬ 
tive  closure  graph  vary  greatly.  In  the  second  case,  the  iterative  algorithms  have  to  join 
two  whole  relations  (often  very  large)  iteratively  to  find  a  small  number  of  tuples  and 
the  total  cost  increases  dramatically.  Thus,  our  recommendation  is  to  implement 
Warren's  algorithm  for  transitive  closure  in  database  systems  and  let  the  query  optim¬ 
izer  select  it  adaptively.  In  the  remainder  of  this  section,  we  briefly  present  some  obser¬ 
vations  for  future  work. 

Auxiliary  Data  Structures.  In  this  work,  we  have  not  assumed  any  auxiliary 
storage  structures  such  as  clustered  or  non-clustered  indices  and  join  indices.  All  opera¬ 
tions  are  applied  to  the  original  data.  Join  indices  have  been  shown  to  improve  the  per¬ 
formance  of  join  operations.  They  also  improve  the  performance  of  iterative  transitive 
closure  algorithms  [\  aldS6j.  The  reason  for  this  is  the  size  of  a  join  index  relation  is  in 
general  less  than  the  binary  relation  size.  Further  investigation  of  the  relative  perfor¬ 
mance  improvement  of  Warren’s  algorithm  resulting  from  the  use  of  auxiliary  data 
structure  is  a  worthwhile  task. 

Search  Techniques.  Depth-first  and  breadth-first  algorithms  have  been  explored 
extensively  to  solve  the  general  search  and  tree  traversal  problems.  Since  transitive  clo¬ 
sure  computation  is  basically  a  graph  search  problem,  both  depth-first  and  breadth-first 
algorithms  can  be  employed  to  compute  the  transitive  closure.  Warren  s  algorithm  can 
be  viewed  as  a  depth-first  algorithm  and  iterative  algorithms  can  be  viewed  as  breadth- 
first  algorithms.  This  analogy  can  be  useful  for  further  research  into  the  application  of 
combined  breadth-first  and  depth-first  transitive  closure  computation  techniques  as  has 
been  suggested  in  solving  other  graph  search  problems.  One  possible  technique  is  to 
apply  an  iterative  algorithm  a  few  number  of  iterations  first  to  find  most  of  the  tuples 
in  the  transitive  closure  and  then  switch  to  the  Warrens  algorithm  to  find  the  few 
tuples  which  can  be  derived  only  through  longer  search  paths. 

Transitive  Closure  vs.  Fix  Point  Queries.  We  concentrated  only  on  the  tran¬ 
sitive  closure  algorithms.  The  algorithms  or  the  evaluation  results  presented  here  may 
not  be  applicable  to  performing  general  least  fixpoint  operations.  Further  research  into 
the  application  of  these  algorithms  to  least  fixpoint  query  execution  is  required. 

Restricted  Transitive  Closure.  In  general,  complete  transitive  closure  is  sel¬ 
dom  required  by  applications:  a  subset  of  the  transitive  closure  is  adequate  for  answer¬ 
ing  many  queries.  The  algorithm  to  handle  the  restricted  transitive  closure  queries  is 
dependent  on  the  restriction  criteria.  However,  general  mechanisms  for  restricting  the 
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output  of  each  iteration  of  transitive  closure  operation  and  terminating  the  transitive 
closure  computation  after  a  specified  number  of  iterations  are  possible.  These  mechan¬ 
isms  might  also  be  useful  in  executing  general  least  fixpoint  queries. 

Multi-processor  Transitive  Closure  Algorithms.  Transitive  closure  is  a 
R  data-intensive  operation.  It  is  possible  to  partition  the  task  of  this  very  large  database 

C  processing  on  to  multiple  processors  and  improve  the  performance  of  transitive  closure 

5  computation  significantly.  For  iterative  algorithms  in  multiprocessor  environment,  the 

y  join  and  union  operations  in  each  iteration  can  be  assigned  to  a  separate  processor 

improving  the  performance  through  concurrent  and  pipeline  processing.  For  executing 
Warren’s  algorithm  using  multiple  processors,  the  search  of  subgraphs  starting  from 
different  nodes  in  the  graph  can  be  assigned  to  different  processor(s).  Another  potential 
area  for  future  work  is  to  design,  analyze  and  evaluate  multiprocessor  based  iterative 
algorithms  and  Warren’s  algorithm. 

5.6.  New  Strategies  for  Optimizing  Transitive  Closure  Evaluation 

In  this  section,  we  are  going  to  propose  two  new  strategies  that  further  optimize 
the  computation  of  the  transitive  closure  of  a  database  relation.  As  a  point  of  terminol¬ 
ogy,  we  refer  to  our  adaptation  of  Warren’s  algorithm  presented  in  the  previous  section 
as  the  recursive  algorithm. 

We  first  assume  that  the  relation  we  are  dealing  with  is  so  large  that  it  is  impossi¬ 
ble  to  hold  all  its  tuples  in  main  memory.  In  this  case  the  computation  of  transitive 
closure,  no  matter  which  algorithm  is  used,  requires  a  large  number  of  join,  union  and 
set  difference  operations  on  very  large  relations.  Partitioning  a  very  large  relation  into 
smaller  disjoint  partitions  has  been  proved  a  reasonable  way  to  dramatically  reduce  the 
costs  of  join  operation  on  large  relations  [DeWi84].  Both  the  analysis  of  the  algorithm 
[Lu87]  and  the  logarithmic  algorithm  [Vald86]  are  based  on  the  hash  join  method.  We 
assume  that  the  same  technique  is  used  in  our  discussion. 

5.6.1.  Strategy  1:  Reduce  the  Size  of  RQ 

Compared  to  the  naive  algorithm,  the  semi-naive  algorithm  focus  on  eliminating 
the  duplication  of  computation  by  only  using  the  newly  generated  tuples  as  one  of  the 
source  relations  of  the  join  in  the  next  iteration.  However,  none  of  the  previous  algo¬ 
rithms  tried  to  reduce  the  size  of  another  source  relation  in  the  join  operation,  relation 
R0.  Since  relation  R0  is  used  in  each  iteration,  its  size  perhaps  has  more  influence  on 
the  performance  of  the  transitive  closure  algorithms. 

Our  first  optimization  strategy  is  to  eliminate  dynamically  those  tuples  from  rela¬ 
tion  Rq  that  will  not  generate  tuples  in  the  result  relation  in  the  later  iterations.  The 
example  in  figure  5.8  is  used  to  explain  the  strategy. 

f?0  consists  of  13  tuples.  For  the  semi-naive  algorithm,  the  first  iteration  joins  f?0 
E  with  Rq  and  generates  A/? ,  = /?0Of?0,  which  consists  of  12  tuples.  Traditionally,  the 

|  second  iteration  will  join  A/? {  with  R0  again  to  generate  A/?2  =  A/?jO/?0.  However,  if 

|  we  examine  the  join  process,  we  can  find  that  some  tuples  in  Rn  will  never  introduce 
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Figure  5.8:  An  Example  of  Computation  of  R0  . 

new  tuples.  These  tuples,  in  the  column  of  Rn  above  the  dotted  line,  can  actually  be 

U  1 

removed  from  R0  without  affecting  the  final  result.  A  new  relation  R0  formed  in  this 
way  can  be  used  in  the  second  iteration  to  compute  A/?2.  In  this  example,  RQl  consists 
of  only  6  tuples,  less  than  50  percent  of  RQ. 

Figure  5.9  lists  algorithm  REDUCE,  an  algorithmic  description  of  the  suggested 

strategy  for  reducing  the  size  of  relation  RQ.  The  notation  used  is  similar  to  that  used 

in  the  semi-naive  algorithm:  two  relations  to  be  joined  in  iteration  t  are  A R  and  R' 

th*  " 

A Rt  contains  new  tuples  in  the  transitive  closure  generated  in  the  (»'  — 1)  iteration. 
Relation  R g  =  RQ,  and  Rg  is  reduced  to  R'0  1 ,  which  is  to  be  used  in  the  next  itera¬ 
tion  to  join  with  A/?i  +  r  Note  that  algorithm  REDUCE  as  described  above  is  for  gen¬ 
eral  cases.  For  a  particular  algorithm,  for  example,  the  semi-naive  algorithm,  the  remo¬ 
val  of  tuples  from  A Rt  is  only  needed  for  the  first  iteration  of  join  RQ  and  R0 :  for  the 
semi-naive  algorithm,  A /?,  only  contains  newly  generated  tuples  which  are  not  in  RQ. 

Graphically,  removing  tuples  as  described  in  the  algorithm  is  the  process  of  remov¬ 
ing  outgoing  edges  from  nodes  satisfying  the  following  conditions:  (i)  there  is  no  incom¬ 
ing  edge  to  the  node,  and  (ii)  all  outgoing  edges  are  already  inserted  to  the  relation. 
The  second  condition  is  automatically  satisfied  because  the  original  relation  RQ  is  copied 
into  the  result.  Since  there  is  no  incoming  edge  to  the  node,  no  more  paths  can  be  gen¬ 
erated  via  the  node,  and  its  removal  from  the  graph  will  not  lose  results.  In  the  above 
example,  node  1  has  no  incoming  edges;  after  the  edges  started  from  it  are  inserted  to 
the  result  relation,  it  can  be  removed  along  with  those  edges.  This  removal  of  node  1 
further  causes  the  removal  of  nodes  2  and  3,  since  only  incoming  edges  for  nodes  2  and 


Algorithm  REDUCE: 

Input  :  Two  intermediate  relations  A Rt  and  R'q 
Output  :  Relation 

begin 

repeat 

foreach  tuple  tZR‘Q  do 
begin 

if  Afl.Of  =  0 
then  begin 

remove  t  from  R‘0; 
if  t€lRl 

then  remove  t  from  A  R  \ 

end; 

end; 

until  no  tuple  can  be  removed  from  R 

R0"l-.=  R'0; 

end; 

Figure  5.9:  Algorithm  Reducing  the  Size  of  RO. 


3  are  from  node  1. 

For  large  database  relations,  it  will  be  very  expensive  if  algorithm  REDUCE  is 
implemented  as  it  is  described  in  figure  5.9.  In  the  next  section,  one  possible  implemen¬ 
tation  is  described  which  modifies  the  hash  join  method  to  dynamically  reduce  the  size 
of  R0  without  heavy  overhead.  Another  point  we  would  like  to  make  is  that  this  stra¬ 
tegy  has  some  flavor  of  using  join  indices  to  compute  the  transitive  closure  [Vald86]: 
only  those  tuples  which  are  joinable  are  kept  for  computation.  However,  join  indices 
are  static  data  structures  and  do  not  change  for  different  iterations  of  the  computation. 
In  our  algorithm,  size  reduction  is  dynamically  performed.  We  have  the  benefit  of 
reducing  the  data  size  without  the  disadvantage  associated  with  join  indices:  the  costs 
of  generating  the  join  indices  and  maintaining  them  in  a  database;  the  difficulty  of 
determining  which  relations  and  on  which  attributes  the  join  indices  should  be  main¬ 
tained;  and  the  complexity  to  determine  whether  it  is  beneficial  to  use  the  join  indices. 

5.6.2.  Strategy  2:  Speed  Up  the  Convergence 

The  number  of  iterations  needed  to  complete  the  transitive  closure  computation  is 
another  source  of  optimization.  The  logarithmic  algorithm  and  smart  algorithms  out¬ 
perform  the  semi-naive  algorithm  since  they  generate  more  tuples  in  one  iteration  and 
fewer  iterations  are  needed.  Intuitively,  the  source  relations  are  only  read  from  the 
disks  once  in  one  iteration.  The  more  tuples  generated  in  one  iteration,  the  fewer 


number  of  iterations  needed  to  complete  the  computation.  Thus,  one  of  the  major  pro¬ 
cessing  costs,  disk  I/Os  for  reading  in  the  source  relation,  is  reduced.  The  CPU  cost, 
such  as  rehashing,  if  hash  join  is  used,  is  also  reduced  partly.  The  savings  gives  the  log¬ 
arithmic  and  smart  algorithm  better  performance  [Vald86,  Ioan86|. 

The  recursive  algorithm  is  an  extreme  aiong  this  direction:  when  a  tuple  is  pro¬ 
cessed,  all  tuples  derivable  from  this  tuple  are  generated.  The  performance  of  the  algo¬ 
rithm  is  irrelevant  to  the  maximum  path  length  of  the  transitive  closure  of  the  relation. 
If  there  are  some  very  long  paths  in  the  transitive  closure,  this  algorithm  will  outper¬ 
form  the  iterative  algorithms.  The  limitation  of  this  algorithm  is  that,  in  order  to  find 
all  tuples  derivable  from  a  tuple,  the  processing  has  the  flavor  of  the  depth-first  search. 
In  cases  where  the  size  of  memory  is  much  smaller  than  the  relation  size,  a  large 
amount  of  disk  access  is  required,  which  leads  to  bad  performance  [Lu87]. 

The  strategy  suggested  here  combines  the  iterative  methods  with  the  recursive  algo¬ 
rithm.  For  each  pair  of  buckets  which  can  be  held  in  main  memory,  all  tuples  in  the 
transitive  closure  derivable  from  them  are  generated.  These  tuples  are  output  either  to 
the  corresponding  buckets  for  further  processing  or  to  the  final  result  relation. 

^  Algorithm  PROCESSING  in  figure  5.10  describes  the  algorithm  of  processing  the 
j  bucket  pair  in  the  k  iteration  using  the  strategy.  A/?  contains  the  tuples  gen¬ 
erated  in  iteration  k-  1  and  is  hashed  on  the  second  attribute.  R  k  is  the  correspond- 

th  U| 

ing  bucket  partitioned  on  the  first  attribute.  Ri  is  the  i  bucket  of  the  result  relation 
R,  the  transitive  closure  of  RQ.  Function  GetBucketNo()  returns  the  bucket  number  a 
tuple  belongs  to  when  hashing  on  the  second  attribute.  The  algorithm  works  as  follows: 
for  each  tuple  t(a,b)  in  A/?*  1 ,  it  finds  all  matching  tuples  from  R.  *.  New  tuples  are 
formed  and  hashed  on  the  second  attribute  to  find  the  buckets  to  which  the  tuples 
belong.  The  tuples  falling  to  the  current  bucket  are  used  to  further  probe  the  hash 
table.  Tuples  of  other  buckets  are  output  to  the  corresponding  buckets.  They  are 
either  processed  in  the  same  iteration  (if  the  bucket  has  not  been  processed  yet),  or  pro¬ 
cessed  in  the  next  iteration.  For  each  tuple,  the  processing  will  terminate  when  cyclic 
data  (a  tuple  t(a,a)  is  obtained)  is  encountered,  or  no  more  matching  tuples  can  be 

found  in  R0  *. 

u» 

We  use  a  simple  example  to  explain  the  algorithm.  Relation  R0  shown  in  figure 
5.11  consists  of  8  tuples.  They  are  partitioned  into  two  pairs  of  buckets,  (Rb  Q  ,i?a0  ) 

6  (2  W I  ”  1 

and  (R  0^,R  0  ),  on  attribute  b  and  a,  respectively,  because  of  the  limitation  of 
memory  size.  A  tuple  t{a,b)iRbQ  iff  hash(t.b)  in  {1,  2,  3}  and  t(a,b)ZRb0  iff 
hash(t.b)  in  {4,  5,  6}.  Partitions  R  “0  and  Rl  '0  are  formed  in  a  similar  way. 

The  computation  starts  with  the  first  pair  of  buckets,  Ai?°  =  Rb Q  and 
i?°0i  =  Ra 0  .  Algorithm  PROCESSING  is  applied  and  the  result  tuples  are  hashed  on 
the  second  attribute.  Those  tuples  with  hash  values  in  {4,  5,  6}  (five  of  them  in  this 
example)  are  appended  to  the  bucket  A R^  (as  shown  in  the  figure  under  the  dotted 
line).  Other  tuples  (in  this  case,  three  )  are  output  as  the  result.  The  second  pair  of 
buckets  is  processed  in  a  similar  way.  The  difference  is  that  the  tuples  generated  with 
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Algorithm  PROCESSING: 

Input  :  A  pair  of  buckets,  A Rt  ,  R0  a 

Output  :  Tuples  in  the  transitive  closure  of  RQ,  which  are  inserted 
into  corresponding  buckets 

begin 

foreach  tuple  t  in  A Rt  do 
probe: 

if  there  is  a  match  tuple  t'  in  RQ  with  t.B  =  t'.A 

then  begin 

form  a  new  tuple  newt(t.A  ,t'  .B)\ 
j  =  GetBucketNo(F.fi); 

then  output  newt  to  A R }  ; 

if  (j  <  i)  k 

then  output  newt  to  A R ^ 
if  (j  =  i) 
then  if  [t.A  ^  t' .B) 
then  goto  probe; 
else  output  newt  into  Rt ; 

end; 


Figure  5.10:  Algorithm  PROCESSING. 

the  hash  value  of  the  second  attribute  in  {1,  2,  3}  are  used  to  form  A  R* ,  which  is  used 
in  the  next  iteration.1 

This  strategy  can  be  explained  intuitively  with  the  graphic  representation  of  RQ  as 
follows:  The  hashing  technique  partitions  the  directed  graph,  G0  into  a  number  of  sub¬ 
graphs  G0  .  An  edge  e:  a^b  is  in  subgraph  G0  iff  6  is  in  bucket  ».  For  each  edge  e  € 

c  • 

G0  (  a-6  ),  algorithm  PROCESSING  finds  all  paths  that  start  from  node  a  and  are 

contained  in  subgraph  G0  .  If  there  is  a  path  leading  to  a  node  c  in  another  subgraph, 

U*  b 

Gn  ,  the  output  of  a  tuple  (o,c)  to  bucket  A R  during  the  processing  can  be  viewed  as 
inserting  a  node  a  and  an  edge  a-c  in  subgraph  Gn  .  Therefore,  any  path  starting 
from  node  a  in  subgraph  G0  and  ending  with  another  node  b  in  subgraph  GQ  can  be 

1  In  the  example  we  Uid  not  show  the  elimination  of  duplicates:  duplicates  in  the 
result  tuples  are  eliminated  before  the  next  iteration,  as  when  using  the  semi-naive 
algorithm. 
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Iteration  1 

Iteration  2 

Af?t°  Ro;  R,1 

A/?,1  Rof  R2 

(6,  1)  (1,  2)  (6,  2) 

(1,  2)  (1,  5)  (6,  3) 

(2,  3)  (2,  3)  (1,  3) 

(5,  3)  (3,  4) 

(3,  1)  (1,  2)  (3,  2) 

(4,  1)  (1,  5)  (3,  3) 

(5,  1)  (2,  3)  (4,  2) 

(2,  1)  (3,  4)  (4,  3) 

(4,  4) 
(5,  2) 
(2,  2) 

(3,  4)  (4,  6)  (3,  6) 

(1,  5)  (5,  3)  (1,  6) 

(4,6)  (5,6)  (1,1) 

(5,  6)  (6,  1)  (6,  6) 

(2,  6) 

(6.4) 

(6,  5) 

(1.4) 

(2.4) 

(5,  4) 

-  (4,  6) 

(3,  5)  (5,  3) 

(4,  5)  (6,  1) 

(5,  5) 

(2,  5) 

Figure  5.11:  An  Example  of  Using  Strategy  2. 


internally  found  in  subgraph  GQ  later  on. 

The  effectiveness  of  this  strategy  is  clearly  shown  by  the  example  in  figure  5.11. 
The  longest  path  in  the  transitive  closure  includes  five  edges  ( 1—2— 3^4— 6—1 ),  which 
requires  five  iterations  for  the  semi-naive  algorithms  and  three  iterations  for  the  loga¬ 
rithmic  algorithm.  However,  only  two  iterations  are  needed  using  our  strategy. 

From  the  example,  we  can  also  see  some  savings  other  than  the  reduction  of  the 
number  of  iterations.  In  the  previous  iterative  algorithms,  new  tuples  generated  during 
computation  have  to  be  read  in  at  least  once  to  join  with  the  original  relation.  In  our 
strategy,  the  result  tuples  corresponding  to  the  paths  which  do  not  cross  the  border  of 
subgraphs  are  not  read  in  again.  In  the  example,  among  23  tuples  generated  in  the 
transitive  closure  (excluding  the  original  tuples  in  R0),  only  12  tuples  are  written  out 
and  then  reread  in  for  later  processing. 


5.6.3.  Algorithm  HYBRIDTC 

In  this  section,  we  describe  a  hash-based  transitive  closure  algorithm.  It  integrates 
the  strategies  described  in  the  last  section.  Since  this  algorithm  combines  the  merits  of 
both  iterative  and  recursive  methods,  we  name  it  algorithm  HYBRIDTC  (a  hyorid  tran¬ 
sitive  closure  algorithm). 
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5. 6.3.1.  The  Algorithm 


Algorithm  HYBRID  TC; 

Input  :  relation  ftQ 

Output:  relation  R,  the  transitive  closure  of  relation  RQ 


begin 

partition  ft„  on  R0-A  and  RQ  B  into 
buckets  R0  a  and  R0  b  (1 

for  i  :=  1  to  N  do  begin 

V; 

V  V  = 

end; 

k  :=  0; 

repeat 

k  :=  k  +  1; 

for  i  :=  1  to  X  do 

if  (A/?,*  #  0)  and  (ft0  *  *  0  ) 

*  jfc 

then  ProcessingBucket  (i,  k,  A ft-,  ftQ  ); 
else  Aft*  :=  0; 
for  i  :=  1  to  X  do  begin 
Aft*  :=  Aft*  -  ft*1; 
until  all  Aftts  are  empty; 

ft  :=  u  *,•  ; 

1  s  i  s  JV 

end. 


Figure  5.12:  Algorithm  HYDRIDTC. 

The  algorithm  is  shown  in  figure  5.12.  Relation  ft0  is  partitioned  into  two  sets  of 
buckets  on  attribute  ft0  A  and  R0-B  as  in  traditional  hash  joins.  These  two  set  of 
buckets  are  denoted  by  ftQ  b  and  RQa  (1  <  i  <  N),  respectively.  We  will  use  sub¬ 
scripts  to  denote  the  bucket  number  and  superscripts  to  denote  the  iteration  number. 
Let  A  ft  contain  the  new  tuples  in  the  transitive  closure  that  belong  to  bucket  «  ( 

hashed  on  attribute  B)  generated  during  the  (fc  — 1)  iteration,  and  ft  *  be  the  reduced 

*  th 

bucket  i  of  ft-  after  (£-1)  iterations.  The  bucket  pair  processed  in  the  k  iteration  is 
A  ft.*  and  RQ  ,  where  Aftf.*  =  ft0  b ,  and  ftQ  1  =  ftQ  a. 

After  the  relation  is  partitioned,  the  Afts  are  initialized  to  be  the  corresponding 

set  of  buckets.  The  processing  of  bucket  pairs  proceeds  iteratively  until  all  Aft.s  are 
th  k  * 

emptv  for  the  k  iteration.  Since  A  ft,  contains  the  most  recently  generated  tuples, 
k  1 

and  Rq  is  also  reduced  during  each  iteration,  procedure  ProcessingBucket  is  only 
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called  when  both  of  them  are  nonempty.  During  the  processing  of  bucket  pair  A/?  k 

k  •  •  1 

and  R0  ,  some  result  tuples  are  inserted  into  Rt,  and  others  are  inserted  to  other  buck¬ 
ets  A R  ,  (j  *  i).  as  described  in  algorithm  PROCESSING.  After  each  iteration  k, 
duplicates  are  eliminated  from  the  A R{  s  which  are  going  to  be  used  in  the  next  itera¬ 
tion. 

procedure  ProcessingBucket  (  bucketno,  iteration  :  integer; 

deltabucket,  bucketRO  :  buckets  ); 

begin 

BuildHashTable(bucketRO); 
foreach  tuple  in  deltabucket  do 

ProcessingTuple(bucketno,  iteration,  tuple); 
foreach  tuple  in  the  hash  table  do 
if  tuple. mark 

then  OutputBucketRO  (tuple,  bucketno,  iteration-1- 1); 

end; 

Figure  5.13:  Procedures  ProcessingBucket. 


The  union  and  duplicate  elimination  procedures  are  the  same  as  any  transitive  clo¬ 
sure  algorithms,  and  we  are  not  going  to  discuss  them  here.  Figure  5.13  and  Figure 
5.14  give  one  possible  implementation  of  the  procedures  ProcessingBucket  and  Proces- 
singTuple.  In  this  implementation,  a  hash  table  is  constructed  for  RQh  as  in  the  tradi¬ 
tional  hash  join  algorithms.  However,  one  extra  field  11  mark”  is  added  to  the  hash  table 
entry.  It  is  used  to  mark  the  tuples  actually  participating  in  the  join.  Procedure  Pro- 
cessingTuple  is  called  for  each  tuple  in  deltabucket  (A/?f).  After  all  tuples  have  been 
processed,  only  those  marked  tuples  are  written  back  by  calling  procedure  OutputBuck¬ 
etRO  to  form  /?Qt- 1. 

Procedure  ProcessmgTuple  implements  strategy  2  using  a  stack  of  tuples.  Push- 
Stack,  PopStack,  and  EmptyStack  are  procedures  and  functions  manipulating  the  stack. 
The  tuple  on  the  top  of  the  stack  is  used  to  look  up  the  hash  table  to  find  matches. 
Those  matching  tuples  can  be  divided  into  three  categories  according  to  the  bucket  it 
belongs  to.  The  bucket  number  of  a  tuple  is  returned  by  function  GetBucketNo.  The 
tuples  of  other  buckets  are  inserted  to  A R}  buckets  by  procedure  OutputDelta.  The 
tuples  of  current  processing  buckets  are  pushed  onto  the  stack  for  later  processing. 
This  process  continues  until  the  stack  is  empty. 

The  advantage  of  using  a  stack  is  its  simplicity.  Another  advantage,  perhaps  a 
more  important  one,  is  ease  of  memory  management.  If  there  is  a  large  number  of 
tuples  derived  from  some  particular  tuple  in  the  bucket  which  leads  to  a  full  stack,  we 
can  just  write  part  of  the  bottom  of  the  stack  on  the  disk  and  reread  it  back  in  to  free 
memory  space  later  on  for  continuing  the  process.  Thus,  algorithm  HYBRIDTC  does 
not  introduce  new  issues  in  memory  management.  Techniques  of  partitioning  a  relation 


procedure  ProcessingTuple(  bucketno,  iteration:  integer; 

inputuple  :  TupleType  ); 

var  currenttuple,  matchtuple,  newtuple  :  TupleType; 
newbucketno  :  integer; 

begin 

PushStack(inputtuple); 
while  (NOT  EmptyStack)  do 

begin 

currenttuple  :=  PopStack; 
if  (currenttuple. a  <  >  currenttuple. b) 

then  begin 

matchtuple  :=  LookUp(currenttuple); 
foreach  matchtuple  do 

begin 

if  (NOT  matchtuple. mark) 
then  matchtuple. mark  :=  true; 

newtuple  :=  FormTuple  (currenttuple. a,  matchtuple. b); 
newbucketno  :=  GetBucketNo(newtuple); 
if  (newbucketno  =  bucketno) 
then  PushStack(newtuple); 
if  (newbucketno  <  bucketno) 

then  OutputDelta(newtuple,  newbucketno,  iteration  +  1); 
if  (newbucketno  >  bucketno) 

then  OutputDelta(newtuple,  newbucketno,  iteration); 

end; 

end; 

OutputResult(bucketno,  currenttuple); 

end; 

end;  (*  procedure  ProcessingTuple  *) 

Figure  5.14:  Procedure  of  Processing  a  Tuple  in  A Rr 

into  buckets  and  of  handling  overflow  buckets  developed  in  hash  join  methods  can  be 
directly  used. 

Now,  we  prove  the  following  Lemma: 

Lemma:  Algorithm  HYBRIDTC  correctly  computes  the  transitive  closure  of  a  database 
relation. 

Proof:  The  proof  of  the  Lemma  consists  of  two  parts.  First,  we  have  already  explained 
in  Section  5.6.2  that  the  removal  of  unmarked  tuples,  the  tuples  not  participating  in  the 
join  in  the  current  iteration,  will  not  lead  to  loss  of  the  result  tuples.  Second,  we  prove 
that  the  algorithm  will  find  all  tuples  in  the  transitive  closure.  In  other  words,  the 
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algorithm  can  find  all  paths  in  graph  GQ  if  relation  R0  is  represented  by  G0.  Let  p  be  a 
path  of  graph  G0.  It  is  obvious  that,  if  all  nodes  on  path  p  are  contained  in  one  sub¬ 
graph  of  Gq,  the  path  can  be  found  by  the  algorithm  when  the  corresponding  buckets 
are  processed.  It  is  more  likely  that  paths  cross  over  the  border  of  subgraphs.  Let  e, 

e 

(a-A)  i  Gq,  be  an  edge,  and  the  end  nodes  of  e  be  a  and  b,  and  they  are  in  two 
different  subgraphs,  Gn  and  Gn  ,  respectively.  Then  tuple  (a, 6)  is  in  bucket  j.  During 
processing  of  bucket  j,  all  paths  of  p  starting  from  b  and  ending  at  some  nodes  y}  in 
G0  can  be  found,  and  a  set  of  tuples  {  (6,  y.),  ...  ,  (6,  y  ),  ...}  is  generated.  If  there 
are  some  paths  starting  from  some  node  xx  and  ending  at  node  a,  the  processing  of 
bucket  «  will  not  only  generate  a  set  of  tuples  {z  ,a},  but  also  generate  a  set  of  tuples 
{z.,6}.  They  are  inserted  into  bucket  j.  Thus,  in  the  next  iteration  of  processing 
bucket  j,  all  paths  starting  from  node  x-  and  ending  at  node  y}  can  be  found.  The 
proof  can  be  extended  to  the  paths  across  any  number  of  subgraphs.  O 

5. 6. 3. 2.  Performance  Comparisons 

Qualitatively,  algorithm  HYBRIDTC  is  expected  to  improve  bad  performance  in 
the  following  ways: 

(1)  Reduce  the  number  of  iterations. 

For  the  semi-naive  and  logarithmic  algorithms,  only  paths  with  certain  lengths  can 
be  found  in  each  iteration.  The  number  of  iterations  needed  to  complete  the  com¬ 
putation  is  determined  by  the  depth  of  the  transitive  closure,  that  is,  the  longest 
path.  For  algorithm  HYBRIDTC,  paths  contained  in  a  subgraph  can  be  generated 
in  a  single  iteration  no  matter  how  long  it  is.  Furthermore,  the  later  processed 
buckets  make  use  of  the  new  tuples  generated  by  the  buckets  which  have  been  pro¬ 
cessed  in  the  same  iteration.  As  a  result,  the  number  of  iterations  needed  largely 
depends  on  how  the  relation  is  partitioned  and  is  usually  less  than  the  depth  of  the 
transitive  closure.  The  reduction  in  the  number  of  iterations  at  least  reduces  the 
disk  I/O  needed  to  read  in  RQ  and  CPU  time  for  constructing  the  hash  tables. 

(2)  Reduce  the  number  of  disk  I/Os  needed  to  read  in  the  delta  relations. 

For  both  the  semi-naive  and  logarithmic  algorithms,  the  result  tuples  generated  in 
one  iteration  have  to  be  written  to  the  disk  and  read  in  again  in  the  next  iteration. 
However,  in  algorithm  HYBRIDTC,  the  tuples  generated  in  one  iteration  need  to 
be  read  in  again  only  if  they  belong  to  other  buckets.  Again,  the  extent  of  this 
savings  largely  depends  on  the  data  distribution  and  the  partitions. 

(3)  Reduce  the  size  of  the  source  relation. 

The  source  relation  used  to  compute  the  transitive  closure  is  dynamically  reduced 
during  processing,  compared  to  the  constant  size  in  the  semi-naive  algorithm  and 
no  optimization  in  the  logarithmic  algorithm. 

Any  quantitative  analysis  of  algorithm  HYBRIDTC  is  difficult,  since  the  perfor¬ 
mance  will  vary  dramatically  with  different  data  characteristics  and  the  partitioning.  In 
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order  to  validate  our  qualitative  analysis  above,  we  made  some  comparisons  between  the 
performance  of  the  semi-naive  algorithm,  the  logarithmic  algorithm,  and  algorithm 
HYBREDTC  as  follows: 


(1)  The  data  model  proposed  by  Bancilhon  and  Ramakrishnan  [Banc86]  is  used.  We 
examined  two  simple  cases,  lists  and  trees  having  fanout  2. 

(2)  We  use  the  number  of  tuples  read  in  during  the  computation  as  the  performance 
measure  for  the  comparison.  This  number  roughly  reflects  the  total  costs  of  the 
computation.  The  larger  the  number  is,  the  more  disk  I/O  cost  and  CPU  cost  for 
constructing  the  hash  tables.  Furthermore,  we  assume  that  duplication  elimination 
costs  are  the  same  for  all  three  algorithms,  and  they  are  not  taken  into  account. 

(3)  Some  of  the  implementation  details  are  ignored.  For  example,  for  the  semi-naive 
algorithm  and  the  logarithmic  algorithm,  we  only  calculate  the  total  number  of 
tuples  of  two  relations  joined  in  each  relation.  This  number  is  therefore  indepen¬ 
dent  of  the  memory  size  and  the  number  of  hash  buckets.  We  actually  assume 
that  the  pipeline  method  is  used  to  reduce  the  number  of  disk  I/Os  [Lu87].  That 
is,  each  tuple  in  the  transitive  closure  only  counts  once:  no  separate  partition 
phase  is  assumed. 


With  the  above  assumptions,  the  total  number  of  tuples  for  the  semi-naive  and  the 
logarithmic  algorithms  are  calculated  as  follows: 

For  the  semi-naive  algorithm,  h  iterations  are  needed  to  generate  all  tuples  in  the 
transitive  closure.  One  more  iteration  is  actually  completed,  resulting  in  the  termina¬ 
tion  of  the  computation.  During  each  iteration,  there  is  only  one  join.  The  total 
number  of  tuples  participating  in  the  join  operations  is: 


semi  -  naive 


=  \iR\\  +  (h  +  l)i  Rq\ 


The  number  of  iterations  needed  in  the  logarithmic  algorithm,  k,  is  determined  by 
k  =  Ig(/i  +  l)  -  1.  For  each  iteration  »,  there  are  two  joins:  the  join  of  with  R' , 


and  the  join  of  the  result  tuples  in  the  transitive  closure  so  far,  which  is  2  R3 ,  with 


the  newly  generated  relation  R  '.  The  total  number  of  tuples  participating  in  the  com¬ 
putation  is: 

*  2' 


logarithmic 


=  2(2**;  +  2  *J) 


The  number  of  tuples  read  in  algorithm  HYBRIDTC  is  obtained  by  simulation:  a 
program  was  coded  to  implement  the  algorithm  in  memory.  A  random  number  genera¬ 
tor  was  used  to  assign  bucket  numbers  for  tuples.  The  corresponding  buckets  were  then 
joined  iteratively  to  compute  the  transitive  closure.  When  each  bucket  pair  was  pro¬ 
cessed,  the  number  of  tuples  in  the  buckets  was  counted.  The  total  number  of  tuples 
read  in  could  thus  be  obtained.  In  the  simulation,  we  used  a  small  bucket  size  (typi¬ 
cally  each  bucket  contains  10  tuples).  Therefore  the  simulation  actually  does  not  favor 
algorithm  HYBREDTC. 
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Figure  5.16:  The  Performance  Comparison  2  (  i?Q:  Tree). 

The  result  of  this  comparison  is  shown  in  figures  5.15  and  5.16.  The  lengths  of  the 
lists  vary  from  100  to  1024.  The  tree  depth  varies  from  4  to  12.  The  comparison  uses 
the  number  of  tuples  in  the  semi-naive  algorithm  as  a  reference.  The  ratio  of 
logarithmic/semi-naive  and  hybrid/semi-naive  are  computed.  The  results  in  the  figures 
show  that  algorithm  HYBRIDTC  consistently  outperforms  the  other  two  algorithms. 
For  lists,  the  ratio  hybrid/semi-naive  is  about  50  percent.  However,  the  ratio  of 
logarithmic/semi-naive  is  about  60  to  70  percent.  This  result  is  expected  as  we  dis¬ 
cussed  above. 
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In  both  figures,  the  ratio  of  logarithmic  to  semi-naive  is  not  monotonic.  Some¬ 
times,  the  semi-naive  algorithm  even  outperform  the  logarithmic  algorithm.  This  hap¬ 
pens  when  the  depth  is  just  larger  than  2  .  This  is  also  observed  by  Ioannidis  [Ioan86]. 
The  explanation  is  that  the  number  of  iterations  of  the  logarithmic  algorithm  is  deter¬ 
mined  by  the  depth.  When  the  depth  increases  to  past  2*,  the  number  of  iterations 
increases  by  1.  That  is,  another  iteration  is  required  to  complete  the  computation  to 
find  just  a  few  more  tuples.  That  is  one  disadvantage  of  the  logarithmic  algorithm. 


Number  of  Buckets 

Figure  5.17:  The  Performance  versus  Number  of  Buckets  (  RQ:  List). 

We  did  not  compare  the  performance  of  algorithm  HYBRIDTC  with  the  recursive 
algorithm.  Its  performance  becomes  much  worse  than  the  logarithmic  algorithm  when 
the  memory  size  is  small,  compared  with  the  relation  size  [Lu87].  However,  algorithm 
HYBRIDTC  still  performs  better  than  the  other  two  algorithms,  even  in  this  case.  Fig¬ 
ure  5.17  illustrates  the  number  of  disk  I/O  tuples  with  the  different  number  of  buckets 
into  which  the  relation  is  partitioned.  When  we  increase  the  number  of  buckets,  which 
simulates  smaller  and  smaller  bucket  size,  the  number  of  disk  I/O  tuples  also  increases. 
However,  it  is  still  less  than  what  needed  in  the  other  two  algorithms. 

5.6.4.  Conclusions 

We  have  discussed  two  strategies  which  optimize  the  computation  of  the  transitive 
closure  of  a  database  relation.  We  also  presented  a  hash-based  algorithm  that 
integrates  these  two  strategies  together.  The  algorithm  is  easy  to  implement  in  real  sys¬ 
tems  by  modifying  the  traditional  hash  join  methods.  A  simple  performance  analysis 
was  conducted,  and  the  results  indicate  that  the  new  algorithm  does  outperform  previ¬ 
ous  algorithms.  This  performance  analysis  is  far  from  complete.  However,  it  does  pro¬ 
vide  the  evidence  that  our  new  strategies  in  optimization  are  in  the  right  direction. 
Further  detailed  implementation  in  relational  database  systems  and  performance 
analysis  is  one  of  the  possible  projects  for  future  work. 
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Besides  better  performance,  the  algorithm  has  some  other  advantages.  For  exam¬ 
ple,  the  algorithm  is  easy  to  extend  to  become  a  distributed  or  parallel  algorithm.  In 
algorithm  HYBRIDTC,  there  is  no  inherent  sequence  among  the  iterations.  For  other 
algorithms,  the  result  of  an  iteration  is  used  as  the  input  of  the  next  iteration.  In  the 
logarithmic  algorithm  the  second  join  in  each  iteration  can  only  be  started  after  the 
first  join  finishes.  For  the  distributed  version  of  algorithm  HYBRIDTC,  each  processor 
or  node  can  work  on  one  or  more  pairs  of  buckets.  The  tuples  generated  at  one  proces¬ 
sor  are  either  processed  locally  or  sent  to  other  processors.  The  only  synchronization 
needed  is  the  final  termination  of  the  whole  computation. 

This  algorithm  can  be  further  optimized  along  the  directions  proposed.  One  possi¬ 
bility  is  as  follows:  the  new  tuples  generated  are  not  only  hashed  on  the  second  attribute 
and  inserted  into  the  corresponding  buckets,  but  also  hashed  on  the  first  attribute  and 
inserted  into  the  second  relation  in  the  join  (f?0  ).  Thus,  more  tuples  can  be  generated 

in  each  iteration,  and  performance  improvement  can  be  expected.  However,  it  is  some¬ 
what  difficult  to  implement  in  real  system  since  the  size  of  RQ  will  change  during  pro¬ 
cessing.  Some  sophisticated  memory  management  strategy  and  bucket  overflow  tech¬ 
niques  have  to  be  developed. 

Algorithm  HYBRIDTC  is  a  basic  algorithm  for  computing  the  simple  transitive  clo¬ 
sure  of  a  relational  database.  Interesting  future  work  is  to  use  it  as  a  base  for  extending 
a  relational  database  management  system  to  include  transitive  closure  as  one  basic 
operation.  To  achieve  this,  the  algorithm  should  be  further  augmented  so  that  more 
complicated  transitive  closure  queries  can  be  processed  efficiently  [Agra87]. 
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CHAPTER  6 

Parallel  Architectures  for  Database  Management 


As  enterprises  use  database  management  systems  to  manage  more  of  their  informa¬ 
tion,  the  size  of  existing  databases  is  increasing  rapidly.  Databases  of  over  100  giga¬ 
bytes  now  exist;  terabyte  databases,  if  they  do  not  exist  now,  will  appear  in  the  next 
few  years.  Managing  these  large  databases  will  require  more  powerful  architectures 
than  are  in  common  use  today.  Technology  now  permits  the  construction  of  multipro¬ 
cessor  database  management  architectures  with  tens  or  hundreds  of  processors,  a  giga¬ 
byte  or  more  of  main  memory,  and  disk  capacity  in  the  terabyte  range.  The  Teradata 
DBC/1012  is  an  example  of  such  an  architecture  [83J. 

The  join  operation  is  an  important  operation  for  relational  database  systems,  and 
will  become  even  more  important  as  logic-based  inference  capabilities  are  added  to  these 
systems.  In  this  chapter,  we  describe  a  number  of  multiprocessor  join  algorithms.  The 
algorithms  use  sort-merge  and  hashing  techniques,  and  are  highly  parallel  and  pipelined. 
The  algorithms  are  designed  to  execute  on  a  multiprocessor  architecture  that  is 
parameterized  in  the  degree  of  memory  sharing,  so  that  tightly  coupled,  loosely  coupled, 
and  intermediate  architectures  can  be  modeled.  Other  architectural  parameters  include 
the  number  of  processors,  number  of  disks,  amount  of  main  memory,  and  interconnec¬ 
tion  network  bandwidth.  We  model  the  performance  of  the  algorithms  analytically  to 
determine  elapsed  time,  resource  utilization,  and  other  quantities  as  functions  of  the 
workload  and  architectural  parameters.  The  join  algorithms  overlap  computation,  disk 
transfers,  and  interconnection  network  transfers.  The  analysis  models  this  overlap  and 
identifies  bottlenecks  that  limit  the  algorithms’  performance.  We  do  not  model  multiple 
simultaneous  join  operations,  and  therefore  do  not  compute  system  throughput.  Based 
on  this  analysis,  we  answer  the  following  questions: 

•  How  do  the  algorithms  compare  in  performance?  When  does  one  outperform 
another? 

•  How  does  response  time  vary  as  a  function  of  the  architectural  parameters? 

•  How  does  response  time  vary  with  the  workload? 

•  Does  shared  memory  help  algorithm  performance?  To  what  extent? 

•  What  are  the  architectural  bottlenecks?  How  could  they  be  alleviated? 

In  the  following  sections,  we  describe  the  multiprocessor  hardware  architecture  and 
the  join  algorithms,  develop  cost  formulas  for  the  algorithms,  compare  the  algorithms’ 
performance  under  for  various  workloads  and  hardware  configurations,  and  summarize 
the  results  of  our  investigation. 
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6.1.  Multiprocessor  Data  Management  Architecture 

Many  specialized  architectures  have  been  proposed  for  high  performance  relational 
database  management.  These  architectures  include  logic-on-disk  machines 
[Schu79,  Su79],  VLSI-based  special  purpose  processors  [Kits83,  Shib84],  and  loosely-  and 
tightly-coupled  multiprocessor  architectures  [DeWi79, 83,DeWi86].  We  believe  that 
commercially  viable  database  machines  must  be  constructed  principally  from  commo¬ 
dity  components  such  as  general  purpose  microprocessors  and  conventional  disk  storage 
devices.  This  belief  is  based  on  the  superior  price/performance  and  reliability  of  com¬ 
modity  components  compared  to  custom  components.  Therefore,  we  consider  in  this 
study  a  multiprocessor  architecture  with  the  following  characteristics: 

•  The  architecture  uses  a  large  number  (tens  to  hundreds,  at  least)  of  processors  to 
obtain  the  necessary  performance.  This  assumes  that  the  processors  can  be  used 
effectively.  The  Teradata  DBC/1012  appears  to  have  demonstrated  that  this  is 
possible. 

•  The  architecture  can  use  large  amounts  (hundreds  of  megabytes  to  hundreds  of 
gigabytes)  of  semiconductor  memory.  In  the  next  few  years,  this  amount  of 
memory  will  be  feasible  as  well  as  cost-effective. 

•  The  architecture  can  support  an  aggregate  disk  capacity  of  a  terabyte  or  more; 
only  a  small  fraction  of  the  total  database  can  be  accommodated  in  main  memory. 
We  assume  further  that  many  of  the  individual  database  relations  will  typically  not 
fit  in  main  memory. 

Figure  6.1  shows  a  block  diagram  of  our  architecture.  The  architecture  consists  of  a  set 
of  clusters  linked  by  an  intercluster  bus  or  ring.  Each  cluster  consists  of  a  set  of  proces¬ 
sors,  a  shared  memory  bank  addressable  by  all  the  processors  in  the  cluster,  and  a  set  of 
disk  storage  units  and  associated  controllers.  Processors  read  and  write  the  shared 
memory  in  units  of  a  few  bytes,  with  little  contention.  The  processors  may  have  local 
caches  to  reduce  memory  contention,  but  this  is  invisible  to  the  data  management 
software  except  for  possibly  the  need  to  flush  the  cache  occasionally.  Transfers  between 
disk  and  memory,  and  between  cluster  memories  over  the  bus,  are  a  page  at  a  time, 
where  a  page  is  a  few  kilobytes  or  more  in  size.  A  specific  configuration  of  this  architec¬ 
ture  is  determined  by  the  following  parameters: 

NC  number  of  clusters 

NP  number  of  processors  per  cluster 

ND  number  of  disks  per  cluster 

M  pages  of  main  memory  available  per  cluster 

PG  page  size  in  bytes 

These  parameters  can  be  varied  to  determine  the  effect  of  architectural  changes.  For 
instance,  if  the  CPU  is  a  bottleneck,  more  CPUs  can  be  added  per  cluster  or  each  CPU 
can  be  made  faster.  (CPU  speed  is  defined  in  terms  of  execution  times  for  basic  opera¬ 
tions  associated  with  the  join  algorithms,  such  as  tuple  move.)  If  the  disk  is  a 
bottleneck,  the  page  size  can  be  increased  or  more  disks  can  be  added  per  cluster.  We 
have  assumed  that  the  network  is  a  single  bus,  so  the  only  architectural  cure  for  a 
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Figure  6.1.  Multiprocessor  Data  Management  Architecture 


network  bottleneck  is  to  increase  the  network  transmission  rate. 


6.2.  Join  Algorithm  Descriptions 

The  problem  that  each  join  algorithm  solves  is  the  following:  given  relations  R  and 
S,  compute  their  natural  join  on  an  unspecified  pair  of  attributes,  giving  output  rela¬ 
tion  O.  R  and  S  are  assumed  to  be  uniformly  partitioned  across  all  disks  on  all  clus¬ 
ters.  The  partition  is  not  determined  by  the  values  of  the  join  attributes,  so  that  tuples 
must  be  transmitted  between  clusters  to  perform  the  join.  Let  Ri  and  5,  denote  the 
fragments  of  R  and  S,  respectively,  stored  on  the  disks  at  cluster  C(.  The  result  of  the 
join  can  be  partitioned  across  the  clusters;  it  need  not  be  collected  on  one  cluster.  No 
projection  is  performed  on  the  result  except  to  remove  the  redundant  copy  of  the  join 
attribute,  so  no  duplicate  tuples  are  produced. 

Assume  without  loss  of  generality  that  S  is  larger  (in  bytes)  than  R .  Some  of  the 
algorithms  transmit  only  one  of  the  relations  over  the  interconnection  network;  they 
transmit  R,  the  smaller  relation. 

Each  algorithm  consists  of  one  or  more  phases.  The  phases  are  executed  one  after 
another;  the  activity  in  one  phase  completes  before  the  next  phase  starts.  Within  each 
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phase,  a  fixed  set  of  processes  execute  in  parallel,  passing  data  to  each  other  (possibly 
over  the  network)  and  reading  from  and  writing  to  disk  in  pipelined  fashion.  Each  clus¬ 
ter  has  dedicated  send  and  receive  processes  to  act  as  intermediaries  between  communi¬ 
cating  processes  on  different  clusters. 

Processes  communicate  with  each  other  via  streams  of  data  pages.  Each  process 
may  have  several  input  and  output  streams.  We  chose  this  granularity  of  communica¬ 
tion  to  minimize  the  interprocess  communication  and  synchronization  overhead.  The 
cost  of  passing  a  page  of  data  between  processes  on  the  same  cluster  .s  assumed  to  be 
negligible  in  comparison  to  the  cost  of  producing  or  consuming  it.  In  some  algorithms, 
processes  on  the  same  cluster  read  concurrently  from  the  same  page  buffer  or  other 
memory  area;  concurrent  reading  and  writing  is  not  used  because  it  would  require  a 
high  synchronization  overhead.  The  interprocess  communication  system  uses  flow  con¬ 
trol  to  match  the  speeds  of  the  producing  and  consuming  processes  and  to  prevent 
buffer  overflow.  Enough  buffers  are  allocated  to  allow  all  processes  to  execute  con¬ 
currently. 

We  present  six  algorithms  here.  The  algorithms  come  in  pairs:  the  first  of  each 
pair  transmits  tuples  from  both  R  and  5  over  the  network,  while  the  second  transmits 
only  tuples  from  R,  reducing  communications  cost  while  increasing  computation.  The 
two  algorithms  comprising  the  first  pair  are  parallel  versions  of  the  the  basic  sort  merge 
join;  they  are  most  similar  to  algorithms  described  in  [Bitt83,  Vald84].  The  other  four 
use  hashing  to  decompose  R  and  5  into  buckets,  and  then  use  either  a  sort-merge  or  a 
hashing  technique  to  join  each  pair  of  buckets.  They  are  similar  to  algorithms 
described  in  [Kits83,  DeWi85]. 


6.2.1.  Parallel  Sort  Merge  Algorithms 


6. 2. 1.1.  Parallel  Sort-Merge  Join  Type  1  (SMJl) 

In  the  basic  sort-merge  join,  each  relation  is  first  sorted  on  its  join  attribute. 
Then,  the  two  sorted  relations  are  merge-joined.  The  merge-join  operation  matches 
tuples  in  the  two  relations  by  their  join  attributes,  and  generates  the  result  tuples.  It  is 
pipelined  much  like  a  merge  operation,  except  that  a  tuple  in  either  input  relation  can 
be  used  to  construct  multiple  output  tuples. 

Previously  published  algorithms  have  presented  parallel  algorithms  for  the  sort 
phase.  Bitton  et  al.  describe  join  algorithms  employing  a  parallel  binary  merge  sort 
and  a  block  bitonic  sort  [Bitt83] .  The  former  can  be  improved,  memory  permitting,  by 
using  a  general  multi-way  merge  [Vald84].  Both  algorithms  start  by  generating  a  set  of 
sorted  runs  from  the  original  unsorted  relation.  These  runs  are  generated  using  a 
main-memory  sorting  algorithm  or  a  priority  queue.  The  latter  is  preferable  because  it 
generates  runs  that  are  on  average  twice  the  size  of  the  main  memory  dedicated  to  the 
priority  queue,  and  hence  twice  the  size  of  the  runs  generated  by  a  main-memory  sort¬ 
ing  algorithm  (Knut73j.  In  addition,  run  generation  by  priority  queue  is  inherently  a 
pipelined  operation,  permitting  better  overlap  between  CPU  and  I/O  than  run  genera¬ 
tion  by  main  memory  sorting.  Once  the  runs  are  generated,  the  final  sorted  output  is 
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generated  either  by  merging  the  runs  or  by  using  a  block  bitonic  algorithm. 

The  sort  merge  join  algorithm  just  described  parallelizes  the  sort  operations,  but 
the  final  merge-join  operation  is  still  performed  sequentially  over  the  entire  length  of 
both  relations.  In  the  algorithm  described  below,  the  final  merge-join  is  partitioned  into 
multiple  parallel  processes  so  that  no  single  process  must  pass  over  all  of  either  relation. 
This  is  accomplished  by  generating  NFRUNr  final  runs  of  relation  R  and  NFRUNS 
final  runs  of  relation  5,  and  merge-joining  each  of  the  final  runs  of  R  with  each  of  the 
final  runs  of  5,  all  in  parallel.  One  of  these  merge-joins  is  executed  at  each  cluster, 
leaving  the  joined  relation  O  partitioned  across  the  clusters.  There  is  an  obvious  con¬ 
straint 

NFRUNrNFRUNs  =  NC 

For  convenience  in  describing  the  algorithm,  let  the  clusters  be  named  for 
1<«'< NFRUNr  and  NFRl'Ns.  The  clusters  are  logically  arranged  as  a  two- 

dimensional  array  with  NFRUNr  rows  and  NFRUNS  columns,  though  their  physical 
interconnection  is  unchanged.  (Assume  for  now  that  there  are  at  least  two  rows  and 
two  columns.  The  degenerate  case  is  discussed  below.)  The  portion  of  R  stored  on  C. 
is  called  Rfj. 

The  algorithm  has  three  major  phases. 

Phase  1:  At  each  cluster  CJ},  jXP  processes  generate  initial  runs  of  R-}  and  write  them 
back  to  disk.  (A JP  is  the  number  of  processors  per  cluster.)  Each  process  has  its  own 
priority  queue  for  generating  the  runs. 

Phase  2:  Each  cluster  C{]  generates  initial  runs  of  5f--  in  the  same  way. 

Phase  3:  Merge  the  initial  runs  of  R  into  NFRUNr  final  runs,  and  the  initial  runs  of 
S  into  NFRUNS  final  runs.  Also,  merge-join  each  final  run  of  R  with  each  final  run  of 
S  to  produce  the  result.  The  merging  and  merge-joining  form  a  five-stage  pipeline. 
The  final  runs  of  R  are  produced  in  two  stages.  The  first  stage  of  the  merge  occurs  at 
each  cluster  CtJ,  where  a  merge  process  merges  all  initial  runs  of  Rt]  into  a  single  sorted 
version  of  R{..  One  of  the  clusters  in  each  row,  the  row  pivot  cluster,  executes  the 
second  stage,  merging  the  sorted  Rfj ’s  into  a  final  sorted  run  for  the  row.  The  final 
runs  of  S  are  generated  in  the  same  way  except  that  a  column  pivot  cluster  executes  the 
second  merge  stage  for  clusters  in  its  column.  The  row  (column)  pivot  clusters  send 
their  final  runs  to  the  other  clusters  in  the  row  (column);  each  cluster  merge-joins  the 
final  run  of  R  for  row  »  with  the  final  run  of  S  for  column  j  to  produce  0|; ,  a  frag¬ 
ment  of  the  final  result.  By  choosing  Ctt  as  the  row  pivot  cluster  for  row  »,  and  C>®  1  ; 
as  the  column  pivot  cluster  of  column  j,  no  cluster  is  both  a  row  and  a  column  pivot. 

6. 2. 1.2.  Parallel  Sort-Merge  Join  Type  2  (SMJ2) 

It  may  be  wasteful  to  sort  both  relations  completely  before  merge-joining  them, 
especially  if  the  join  selectivity  is  low.  At  each  stage  of  the  merging  process,  tuples  are 
being  processed  that  may  not  participate  in  the  final  join.  The  following  algorithm 
makes  only  one  pass  over  the  larger  relation  (S).  In  the  description,  we  revert  to  single 
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subscripts  on  clusters  and  relation  fragments. 

The  algorithm  has  three  phases. 

Phase  1:  Generate  local  runs  of  Rx  at  each  cluster  C,  as  in  algorithm  SMJl. 

Phase  2:  Merge  these  runs  into  a  single  sorted  version  of  R.  This  is  done  with  a  two- 
stage  merge.  Each  cluster  executes  the  first  stage,  merging  the  local  runs  into  a  sorted 
version  of  /?•.  The  results  from  each  cluster  are  sent  to  Cv  which  executes  the  second 
stage  of  the  merge  and  broadcasts  the  resulting  sorted  R  to  all  clusters.  Each  cluster 
writes  this  run  to  disk. 

Phase  3:  Generate  runs  of  S  at  each  cluster  using  a  priority  queue.  However,  instead 
of  writing  these  runs  to  disk,  merge-join  them  immediately  with  R  to  produce  the  out¬ 
put  tuples.  Let  there  be  NP  processes  at  each  cluster  executing  the  run  generation 
stage,  paired  with  an  equal  number  of  processes  executing  the  merge-join  stage. 

The  advantage  of  this  algorithm  is  that  it  produces  the  join  results  with  one  pass 
over  the  S  relation.  When  S  is  large,  this  presumably  saves  much  of  the  I/O  and  pro¬ 
cessing  that  would  otherwise  be  required  to  produce  the  final  runs  of  S .  The  disadvan¬ 
tage  is  that  R  must  be  read  repeatedly  from  disk  to  be  joined  against  the  runs  of  S.  (If 
R  fits  entirely  in  main  memory,  this  is  unnecessary.  However,  hash-based  algorithms 
may  be  superior  in  this  case.  We  do  not  attempt  to  fit  all  of  R  in  main  memory.) 

6.2.2.  Hash  Partitioning  Join  Algorithms 

These  algorithms  all  use  a  hash  partitioning  technique  described  in  [DeWi84]  to 
decompose  a  join  of  two  large  relations  into  a  sequence  of  smaller  joins.  They  partition 
tuples  of  R  and  S  into  batches  RBQ,  .  .  .  ,RBnba tch  and  SBQ,  .  .  .  ,SBnbatch  and 
join  the  respective  batches.  The  partitioning  is  based  on  the  value  of  a  hash  function 
applied  to  the  join  attribute,  so  that  joining  the  respective  batches  generates  all 
required  result  tuples.  The  batches  are  sized  so  that  each  batch  join  can  be  performed 
in  main  memory.  Batches  l,  ,  NBATCH  of  each  relation  are  written  to  disk  during 
partitioning.  Batches  RBQ  and  SB0  are  joined  during  or  immediately  after  partitioning, 
depending  on  the  particular  algorithm.  If  sufficient  memory  is  available,  NBATCH  =  0 
and  no  batches  must  be  written  to  disk.  Otherwise,  batches  1,  .  .  .  ,  NBATCH  are  read 
from  disk  and  joined  one  after  the  other.  The  number  of  batches  can  be  computed 
from  the  size  of  the  relations  and  available  memory;  see  Section  6.3.  This  partitioning 
technique  minimizes  the  amount  of  intermediate  data  that  must  be  written  to  disk. 

All  of  the  algorithms  partition  the  batches  further  into  buckets  and  join  the  buck¬ 
ets  within  each  batch  in  parallel.  Two  of  the  algorithms,  HSM1  and  HSM2,  join  respec¬ 
tive  buckets  using  a  sort-merge  technique  similar  to  the  GRACE  algorithm  (Kits83). 
The  other  two,  HH1  and  HH2,  use  the  in-memory  hash  table  technique  described  in 
[DeWi84,Brat84|.  The  type  1  algorithms  HSM1  and  HSM2  transmit  both  R  and  5  over 
the  network;  the  type  2  algorithms  HSM2  and  HH2  transmit  only  R,  the  smaller  rela- 
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6. 2. 2.1.  Hash  Based  Sort  Merge  Join  Type  1  (HSMl) 

This  algorithm  partitions  each  batch  of  R  and  S  further  into  NCNP]gin  buckets 
that  are  joined  in  parallel  by  A <Pjoin  processes  on  each  of  the  JVC  clusters.  A  hash  func¬ 
tion  applied  to  the  join  attribute  of  each  tuple  determines  the  batch,  cluster,  and  pro¬ 
cess  in  which  it  will  be  joined.  Each  pair  of  buckets  is  joined  using  a  sort-merge  join. 

Let  RB{.  denote  the  subset  of  RB  '■  derived  from  /?t  ,  the  subset  of  R  stored  at  C  . 
Define  SBt]  similarly. 

The  algorithm  has  three  phases. 

Phase  1:  At  each  cluster  C(,  NP]gtn  processes  read  Rt  from  disk  a  page  at  a  time. 
The  assignment  of  pages  to  processes  is  arbitrary.  The  processes  hashes  each  tuple  in  a 
page  to  determine  the  batch,  cluster,  and  process  in  which  it  will  be  joined.  If  the  tuple 
is  in  batch  j# 0,  it  is  placed  in  a  buffer  to  be  written  to  a  disk  file  for  RBtJ.  If  the  tuple 
belongs  to  batch  0  but  is  to  be  joined  on  a  different  cluster,  it  is  placed  in  a  buffer  to  be 
sent  to  that  cluster.  If  the  tuple  is  to  be  joined  by  a  different  process  on  the  same  clus¬ 
ter,  it  is  placed  in  a  buffer  to  be  sent  to  the  correct  process.  When  a  process  fills  one  of 
these  buffers,  it  writes  it  to  disk  or  sends  it  to  another  cluster  or  process  as  appropriate. 
When  a  page  is  sent  to  another  cluster,  it  is  received  by  an  arbitrary  process  on  that 
cluster;  the  tuples  in  the  page  are  rehashed  and  sent  to  another  process  in  the  cluster  if 
necessary.  Once  the  tuple  arrives  at  the  correct  process,  it  is  inserted  into  a  binary 
search  tree.  The  search  tree  will  be  traversed  inorder  in  the  next  phase  to  produce  a 
sorted  version  of  the  bucket. 

When  all  of  R  has  been  read,  S(  is  processed  in  the  same  way. 

Phase  2:  This  phase  is  repeated  for  j  ranging  from  1  to  NBATCH.  If  NBATCH  =  0, 
this  phase  is  omitted. 

At  each  cluster  C  ,  each  process  performs  an  inorder  traversal  of  its  R  and  5  search 
trees  to  produce  a  sorted  stream  of  tuples  for  each  bucket.  It  merge-joins  these  tuples 
to  produce  the  join  results.  As  tuples  are  consumed,  the  space  they  occupied  is  freed. 
As  the  memory  is  freed,  the  processes  read  the  file  for  batch  RBtj,  hash  each  tuple,  send 
the  tuples  to  the  appropriate  cluster  and  process,  and  insert  them  into  search  trees  occu¬ 
pying  the  newly  freed  space.  When  all  of  RBi}  has  been  read,  SBl}  is  processed  in  the 
same  way. 

Phase  3:  In  this  phase,  the  buckets  in  batch  NBATCH  are  joined  as  described  for 
phase  2.  No  data  remains  to  be  read  from  disk  and  partitioned. 

6. 2. 2. 2.  Hash  Based  Sort-Merge  Join  Type  2  (HSM2) 

This  algorithm  differs  from  algorithm  HSMl  in  that  relation  5  is  not  sent  over  the 
network.  Instead,  each  cluster  C(  joins  all  of  R  with  its  portion  St  of  relation  5.  A 
hash  function  on  the  join  attribute  of  each  tuple  determines  the  batch  to  which  the 
tuple  belongs,  and  the  process  on  each  cluster  (in  the  case  of  R)  or  the  process  on  clus¬ 
ter  Ct  (in  the  case  of  5t)  that  will  do  the  joining. 
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The  algorithm  has  three  phases.  Only  the  first  phase  will  be  described;  the  other 
phases  should  be  clear  from  the  description  of  phase  1  and  Algorithm  HSM1. 

Phase  1:  At  each  cluster  C(,  ArPAai/l  processes  read  Rr  They  hash  each  tuple  to  deter¬ 
mine  the  batch  and  process  number.  Each  tuple  is  buffered  either  to  be  written  to  disk 
or  to  be  broadcast  to  the  correct  process  on  each  cluster.  Each  process  constructs  a 
binary  search  tree  of  tuples  belonging  to  its  own  bucket.  When  all  of  Rt  has  been  read, 
Sf  is  processed  in  the  same  way,  except  that  tuples  from  S(  are  not  transmitted  to  other 
clusters,  only  to  other  processes  on  the  same  cluster. 

6. 2. 2.3.  Multiprocessor  Hybrid  Hash  Join  Type  1  (HHl) 

Algorithm  HHl  partitions  R  and  S  into  batches  and  buckets  in  the  same  ways  as 
Algorithm  HSMl.  However,  it  uses  a  hash-based  algorithm  to  join  each  pair  of  buckets. 
An  in-memory  hash  table  is  constructed  for  each  bucket  of  R .  This  table  is  then 
probed  using  tuples  from  the  corresponding  bucket  of  S  to  produce  the  join  results. 
The  concurrent  processing  of  consecutive  batches  performed  in  phase  2  of  Algorithm 
HSMl  is  not  possible  here  because  the  hash  table  for  a  bucket  of  R  cannot  be  deallo¬ 
cated  until  it  has  been  probed  by  all  the  tuples  in  the  corresponding  5  bucket.  On  the 
other  hand,  Algorithm  HSMl  requires  memory  to  hold  S  buckets,  while  this  algorithm 
does  not. 

The  algorithm  has  four  phases. 

Phase  1:  At  each  cluster  C(,  A rP]0tn  processes  read  R{  from  disk.  They  hash  each 
tuple  and  copy  it  to  the  appropriate  buffer  if  it  must  be  written  back  to  disk  or  sent  to 
another  cluster  or  process,  as  in  Algorithm  HSMl.  Each  process  constructs  a  hash  table 
of  tuples  belonging  to  its  own  bucket. 

Phase  2:  At  each  cluster  C(,  the  A PJ0tn  processes  read  S-  from  disk.  They  hash  each 
tuple  and  buffer  it  to  be  written  to  disk  or  sent  to  another  cluster  if  necessary.  It  is  not 
necessary  to  send  a  tuple  from  one  process  to  another  on  the  same  cluster,  however. 
Once  a  tuple  is  at  the  correct  cluster,  any  process  can  probe  the  appropriate  hash  table 
to  generate  the  join  results. 

Phases  3  and  4  are  repeated  for  j  ranging  from  1  to  NBATCH.  If  NBATCH  =  0, 
these  phases  are  omitted. 

Phase  3:  This  is  similar  to  phase  1  except  that  each  cluster  Ct  reads  RBt}  from  disk 
instead  of  Rt,  and  performs  no  disk  writes. 

Phase  4:  This  is  similar  to  phase  2  except  that  each  cluster  Ct  reads  SB  from  disk 
instead  of  5-,  and  performs  no  disk  writes. 

6.2. 2.4.  Multiprocessor  Hybrid  Hash  Join  Type  2  (HH2) 

This  algorithm  is  to  Algorithm  HHl  as  Algorithm  HSM2  is  to  Algorithm  HSMl.  It 
has  four  phases  similar  to  those  of  Algorithm  HHl.  However,  tuples  of  S  are  not 
transmitted  over  the  network.  In  fact,  they  are  not  even  transmitted  between  processes 
on  tfi?  same  cluster  since  any  process  can  probe  a  hash  table  on  the  same  cluster. 
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6.2.3.  Discussion 

The  algorithms  described  above  represent  the  latest  versions  in  a  sequence  of  algo¬ 
rithms.  These  versions  provide  better  overlap  in  the  usage  of  different  resources  than 
earlier  versions.  For  example,  in  all  four  hash  partitioning  algorithms,  the  communica¬ 
tions  load  is  spread  as  evenly  as  possible  over  the  duration  of  the  algorithm  execution. 
Tuples  are  sent  across  the  network  only  when  they  are  about  to  participate  in  a  join. 
One  of  our  earlier  versions  transmitted  all  tuples  to  the  joining  cluster  when  the  rela¬ 
tions  were  being  partitioned,  as  in  DeWitt  and  Gerber’s  algorithm  [DeV\i85j.  VVe  found 
that  this  could  cause  a  network  bottleneck  during  partitioning:  the  disks  and  CPI’s 
were  not  well  utilized.  Spreading  the  communications  load  over  the  duration  of  the 
algorithms  reduced  their  execution  time. 

The  overlapping  among  disk  I/O.  CPL  processing  and  data  transfer  over  the  net¬ 
work  gives  algorithm  designers  opportunities  to  tune  the  algorithms  to  obtain  the 
desired  tradeoff  between  the  elapsed  time,  total  processing  cost  and  memory  usage 
which  is  best  for  their  system.  Another  example  regarding  this  is  the  way  the  hash- 
based  algorithms  store  tuples  in  batches  1  —  SBATCH .  One  possibility  is  to  use  one 
file  per  batch  at  each  cluster:  we  chose  to  use  one  file  per  remote  cluster  at  each  cluster. 
In  the  former  case,  no  repartitioning  is  needed  in  the  later  phases,  but  more  buffer  pages 
have  to  be  allocated  in  the  partitioning  phase.  In  the  latter  case,  the  tuples  in  the  same 
batch  have  to  be  rehashed  to  determine  its  bucket  number,  but  far  fewer  buffer  pages 
are  needed  to  hold  the  tuples  rewritten  to -disks. 

6.3.  Performance  Comparisons 

This  section  presents  performance  comparisons  of  the  six  join  algorithms  described 
in  Section  6.2.  These  comparisons  are  based  on  the  formulas  obtained  from  the 
analysis. 

The  main  purpose  of  the  performance  comparison  is  to  get  some  insight  of  the 
behavior  of  different  algorithms.  The  novelty  of  this  performance  analysis  lies  in  two 
facts.  First,  there  are  few  comprehensive  performance  studies  of  join  algorithms  in  the 
multiprocessor-multi  disk  environment  [DeWi85].  Second,  most  performance  studies  use 
total  processing  time  as  the  metric.  In  fact,  disk  I/O  operations,  data  transfer  along  the 
network  and  CPL'  processing  are  often  overlapped.  The  extent  of  the  overlap  varies 
among  different  algorithms  which  leads  to  different  elapsed  times  even  with  the  same 
total  processing  cost.  One  of  the  goals  of  parallel  join  algorithm  design  should  be  to 
overlap  the  usage  of  various  system  resources  as  much  as  possible  while  holding  total 
resource  usage  constant. 

The  tests  conducted  can  be  categorized  into  three  groups  that  investigate  (1)  the 
effects  of  communication  speed:  (2)  the  effects  of  system  configurations:  and  (3)  the 
effects  of  data  sizes.  In  this  section,  we  first  describe  the  methodology  used  in  the 
analysis.  Then  the  details  of  the  tests,  including  the  parameter  settings  and  test  results 
are  discussed. 


6.3.1.  Analysis  Methodology 

For  each  algorithm,  we  will  compute  the  following  quantities: 

T  elapsed  time 

T  total  CPU  time 

Tiisk  total  disk  transfer  time 

Tn(t  total  network  transfer  time 

Each  of  these  is  the  sum  over  all  phases  i  of  the  corresponding  per-phase  quantities  Tx , 
T*  ,  T'duk,  and  T'net.  Resource  utilization  percentages  are  easily  derived  from  these 
basic  measures.  The  analysis  uses  the  following  times  for  basic  operations: 

tcomp  CPU  time  to  compare  two  attributes 

tfiash  CPU  time  to  compute  hash  function  of  a  key 

tmo V(  CPU  time  to  move  a  tuple  in  memory 

tswapp  CPU  time  to  swap  two  pointers  in  memory 

*  build -tuple  CPU  time  to  build  a  join  result  tuple 

taend  CPU  time  to  send  a  page  over  network 

t  CPU  time  to  receive  a  page  over  network 

t  (  network  hardware  page  transfer  time 

tdisk  disk  page  transfer  time 

We  first  compute  the  following  basic  quantities  for  each  phase  i: 

Pdisk  the  number  of  disk  transfer  pages 
P'send  the  number  of  pages  sent  over  the  network 

P'recv  the  number  of  pages  received  over  the  network 

T'jom  CPU  time  unrelated  to  disk  or  net  transfers 

P'rtcv  can  be  greater  than  P'send  due  to  broadcasting.  Then, 

r'  =7*  +  P'  -t  +  P'  ■ t 

epu  join  send  send  recv  recv 

T'  =  P'  ■ t 

1  disk  r  disk  ldisk 

T'  =  P*  ■ t 

1  net  1  send  1 net 

We  assume  that  the  CPU  time  attributable  to  disk  transfers  is  negligible.  For  network 
communication,  we  consider  both  the  CPU  time  and  the  hardware  transfer  time;  either 
is  a  potential  bottleneck. 

The  elapsed  time  T'  for  phase  i  will  in  general  be  significantly  less  than 
T'cpu+  Tdtik+  Tn(i  due  to  overlap.  It  is  computed  as  the  minimum  of: 

•  The  total  disk  transfer  time  of  any  disk.  If  I/O  is  spread  evenly  over  all  disks,  this 
quantity  is  Tdi)k/(NC •  ND). 

•  The  total  network  transfer  time  T' t  for  the  phase. 

•  The  total  CPU  time  for  any  single  process,  including  network  send  and  receive 
processes. 
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•  The  total  CPU  time  for  any  cluster,  divided  by  NP ,  the  number  of  processors  per 
cluster.  If  processing  is  spread  evenly  over  all  clusters,  this  quantity  is 
T'epANC-NP).  This  quantity  models  processor  sharing  among  the  processes  at  a 
cluster. 

The  rationale  for  this  approximation  of  elapsed  time  is  as  follows.  In  each  phase, 
the  processes  execute  in  pipelined  fashion.  The  time  for  data  to  flow  from  the  begin¬ 
ning  of  a  pipeline  to  the  end  is  assumed  to  be  negligible  compared  to  the  total  elapsed 
time,  so  the  pipeline  is  in  steady  state  for  most  of  the  phase.  Sufficient  buffering  is  pro¬ 
vided  to  permit  all  processes  to  execute  in  parallel  with  each  other  and  with  disk  and 
network  transfers: 

•  One  buffer  is  allocated  for  each  input  and  output  stream  for  each  process. 

•  One  buffer  is  allocated  for  each  disk  to  contain  the  data  being  read  or  written. 

•  One  buffer  is  allocated  for  the  network  send  process  at  each  cluster  to  hold  the 
next  page  to  be  transmitted,  and  one  buffer  for  the  network  receive  process  to  hold 
the  next  incoming  page. 

The  detailed  formulas  derived  in  the  analysis  are  listed  in  the  next  section. 


6. 3. 1.1.  Analysis  of  Join  Algorithms  in  Section  6.2 

In  this  section,  we  list  the  detailed  formulas  used  in  the  analysis  of  the  join  algo¬ 
rithms  described  in  6.3. 

6. 3. 1.1.1.  Elapsed  Time  of  Type  1  Sort-Merge  Join  Algorithm,  SMJl 

The  parameters  that  determine  the  performance  of  the  algorithm  include  the  archi¬ 
tecture,  timing,  and  workload  parameters  listed  earlier,  and  algorithm-specific  parame¬ 
ters:  NFRUNr,  the  number  of  final  runs  of  R,  NFRUNS,  the  number  of  final  runs  of 
5,  and  NPheap  >  the  number  of  heap  processes  per  cluster. 

There  are  three  major  phases  in  this  algorithm:  (1)  Generate  runs  of  R,  (2)  Gen¬ 
erate  runs  of  5,  and  (3)  Merge  i?-runs,  merge  5-runs,  and  merge-join.  The  total  execu¬ 
tion  time  for  the  algorithm  is  therefore  TSMJl  =  T ^  +  T2  +  T3 

Tj  and  T2  —  Execution  Time  of  Phase  1  and  2.  The  memory  requirements 
per  cluster  for  this  phase  are  as  follows:  ND  +  NPkeap  input  buffers,  the  same  number  of 
output  buffers,  and  NPkeap  heap  areas  of  the  size  required  to  fill  the  remainder  of  the  M 
bytes  dedicated  to  the  algorithm.  Let  this  size  be  H  bytes;  NHRECr  and  NHPAG  be 
the  number  of  R  records  and  pages  that  it  can  hold.  We  have 
M-2PG-(ND  +  NPkeap) 
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There  is  three  processing  stage  in  phase  1:  IRH  (read  R  from  disk  until  heap 
filled),  IDRH  (read  R  from  disk  and  write  runs  to  disk  until  all  of  R  is  read  from  disk), 
and  DRH  (empty  heap  onto  disk,  completing  R- runs).  The  heap  processing  time  of 
these  stages  is  calculated  as  follows: 


hftH  TPR(t 


move 


‘iDRH  = TPR  <n 

t. 


'DRH  PPR'(‘mo«t 

The  execution  time  of  this  phase  is: 
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The  calculation  of  execution  time  of  phase  2,  T2  is  the  same  as  for  phase  1,  substi¬ 
tuting  S  for  R. 

T3  —  Execution  Time  of  Phase  3.  There  are  five  pipelined  processing  stages: 
the  first  and  second  merge  stages  for  R  and  S,  and  the  merge- join,  denoted  as  MIR, 
M2R ,  MlS,  MSS,  and  MJ .  Stages  MIR,  M2R ,  and  MJ  are  performed  at  each  cluster; 
vcage  M2R  is  performed  at  NFRUNr  clusters,  and  stage  MSS  is  performed  at  NFRUNS 

clusters.  Since  the  runs  produced  in  phase  1  contain  an  average  of  2  NHPAG  pages,  the 

!  R  i 

merge  factor  for  stage  MIR  is  MF1R  =  - .  The  merge  factor  MF1S  for 

2NHPAGNC 

stage  MIS  is  similar.  The  merge  factor  for  stage  M2R  is  NFRUNS.  The  merge  factor 
for  stage  MSS  is  NFRUNr. 

Each  cluster  requires  two  buffers  per  input  stream  and  two  buffers  per  output 
stream  for  stages  MIR,  M2R ,  and  MJ.  Assuming  that  merging  and  merge-joining  have 
no  other  memory  requirements,  this  gives  a  memory  requirement  per  cluster  (in  bytes) 
of  2-(MFlR  + MFls  +  4)-PG .  The  NFRUNr  clusters  that  perform  MSR  require  an  addi¬ 
tional  2-(NFRUNr  +  l)-PG  bytes,  and  the  NFRUNS  clusters  that  perform  MSS  require 
an  additional  2-(NFRUNs  +  \)-PG  bytes. 

The  per-page  execution  times  for  the  stages  are: 

* MlR  =  TPR  '{tmove  ^^’^comp  +  *iu/app )) 

tmR  =  TPRitmm  +  i2-tcmp  +  t„,JMNFRUNR)) 

*mis  ~  +(2'fcl)mp + 

‘ms  ’  r/>5  (f„a„+(2-f„mp  +  <,.w)lg(A’F«C-A's)) 
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The  network  sends  and  receives  performed  by  the  pivot  and  non-pivot  clusters  are 
summarized  in  the  following  table: 
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transmitted  over  the  network  is 
)•  I  5  I  This  is  less  than  the  total  number  of  pages 


1  $  i»i 

received  because  the  pivot  clusters  use  a  one-to-many  broadcast  to  send  their  results  to 
the  merge-join  processes. 

The  total  execution  time  for  phase  3  is  therefore 
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6.3. 1.1. 2.  Elapsed  Time  of  Type  2  Sort-Merge  Join  Algorithm,  SMJ2 

There  are  also  three  major  phases  in  algorithm  SMJ2:  (1)  Generate  initial  runs  of 
R  and  store  on  disk,  (2)  Merge  fZ-runs  into  complete  sorted  relations,  broadcast  to  all 
clusters,  and  (3)  Generate  5-runs  and  merge-join  with  R. 


Two  algorithm-specific  parameters  are:  ArPfteap,  the  number  of  heap  processes  used 
in  phase  1  to  generate  the  initial  runs  of  R ,  and  NP  ■ ,  the  number  of  heap  and  merge- 
join  process  pairs  in  phase  3. 

Tj  —  the  execution  time  of  phase  1,  is  identical  to  that  of  phase  1  for  the  type 
1  sort  merge  join  algorithm. 

T2  —  Execution  Time  of  Phase  2.  There  are  two  processing  stages:  the  first 
and  second  merge  stages,  denoted  MIR  and  M2R .  MIR  is  performed  at  all  clusters, 
and  M2R  is  performed  at  Cy  Clusters  other  than  Cx  require  a  process  to  receive  R 

from  the  network  and  write  it  to  disk.  The  merge  factor  for  MIR  is  (as  before) 
I  R  ! 

MF1r  =  - ,  and  the  merge  factor  for  M2R  is  NC. 

2  NHPAGNC 

Inputs  and  outputs  are  double  buffered.  Clusters  other  than  C j  execute  MIR  and 
the  net-to-disk  copy,  so  they  require  2-(MFlR  +  3)-PG  bytes.  Cluster  Cj  requires  an 
additional  2-(NC  +  l)PG  bytes  for  stage  M2R .  The  per-page  execution  times  are: 

tMiR  ~  TPR  '{tmove  +  (2'<comp  + 

lMSR  =  TPR  '[tmove  +{2'tcomp  +  *4»app)‘te(ArC)) 

At  each  cluster,  MlR  produces  If?  I  /NC  pages.  M2R  produces  I  R  I  pages  at  Cy 
Each  cluster  reads  If?  I /ArC  pages  and  writes  If?  I  pages.  A  total  of  (2—  1/NC)- 1  f?  I 
pages  are  sent  over  the  network.  The  total  execution  time  for  this  phase  is  therefore 

llf?i  1 


T,  =  max((l  + 


R  I  •( 


T3  —  Execution  Time  of  Phase  3.  The  heap  and  merge-join  stages  of  this 
phase  are  denoted  IDSH  and  MJ  respectively.  Assuming  NPm]  heap  processes  and 
merge-join  processes  at  each  cluster,  the  per-cluster  memory  requirements  are  as  follows: 
(ND  +  NPm})-PG  bytes  for  input  buffers  to  IDSH,  2-NPm]-PG  bytes  for  buffers  between 
IDSH  and  MJ  processes,  a  like  amount  for  input  buffers  to  MJ  for  relation  f?,  and 
NPmj'Hs  bytes  for  the  heaps,  where  Hs  is  defined  so  that  available  memory  is  fully  util- 
iied:  Ht  and  the  number  of  record  and  pages  of  S  that  can  fit  in  each  heap  are,  respec¬ 
tively: 


»s  = 


M-NDPG 


-5  PG, 


NHRECS  = 


F-ts(S) 


NHPAGS  = 


FPG 


The  following  analysis  is  simplified  by  assuming  that  the  I/O  rate  is  constant  over 
the  phase.  In  fact,  f?  is  read  only  after  the  heaps  are  filled. 
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There  is  no  network  I/O.  The  per-page  processing  times  for  the  two  stages  are: 

trncu  =  TPS(2t  +  (2 -t  +t  y\g(NHREC <)) 
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ISI  IS 

T3  =  max((  +  - 


NC  2  ■  NP  ■  NHP  A  G  ?  •  NC 

mj  j 


NP  ■  NC 

mj 


t-omp  ^butld-  tuple} 


1  R  1  )'W*  tjDSH’ 

NP  NC 

mj 


SK&S 


6.3. 1.1. 3.  Analysis  of  Algorithms 

In  the  following  sections  we  list  the  formulas  used  in  evaluting  hash-based  join 
algorithms,  which  compute  the  number  of  disk  I/O  and  network  transfer  and  the  CPU 
processing  time  for  each  phase.  The  phases  3  and  4  for  hybrid  hash  algorithms  and  the 
phases  of  partitioning  and  joining  batches  1  to  NBATCH  are  the  same,  the  quantities 
listed  in  the  formulas  are  the  totals  of  all  these  batches.  The  elapsed  time  is  computed 
from  the  number  of  disk  I/O’s,  the  number  of  pages  transferred,  and  the  CPU  time 
obtained  using  the  method  described  in  Section  6.3.1. 

The  Number  of  Batches  and  Batch  Sizes.  Before  giving  the  formulas,  we  first 
briefly  describe  the  computation  of  the  number  of  batches  and  the  data  size  of  the 
batches,  NBATCH,  I  RB0  I ,  IIPB0II,  I  SBQ  I ,  llS£0  etc.  These  quantities  are  deter¬ 
mined  by  the  amount  of  memory  left  in  each  cluster  after  allocating  buffers  for  disk  and 
network  I/O  and  interprocess  communication  according  to  the  principles  described  in 
the  above  section.  They  are  computed  as  follows: 

Let  NBFQ  be  the  number  of  buffer  pages  required  during  partitioning  and  NBF x  be 
the  number  of  buffer  pages  required  at  other  times.  Let  us  first  compute  NBF v 
which  is  algorithm-specific.  For  the  first  type  of  hash-based  algorithms  HSMl  and 
HH1,  where  both  relations  are  transferred  among  clusters,  each  join  processor  has 
three  output  streams  after  reading  in  and  hashing  tuples:  tuples  to  be  used  to 
probe  the  hash  table  for  itself,  tuples  to  be  used  for  other  processors  at  the  same 
cluster  to  probe  the  corresponding  hash  tables,  and  tuples  to  send  to  other  clusters. 
That  is,  three  kinds  of  buffer  pages  have  to  be  allocated  for  each  processor.  The 
total  number  of  buffer  pages  is  therefore 

NBF ,  =  NCiND+2+NP}0miNC  +  NPJoin-l)) 

For  another  two  algorithms,  HSM2  and  HH2,  where  the  small  relation  is  replicated 
at  every  cluster,  all  tuples  arriving  at  a  join  processor  will  either  be  processed  by 
the  processor  itself,  or  processors  at  the  same  cluster,  less  buffer  pages  are  needed. 
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For  HSM2  and  HH2, 

NBF{  =  NCiND+2  +  NP}giniNP]0in  +  l)) 

In  the  partition  phase,  each  processor  also  generates  a  stream  of  tuples  which  are 
written  back  to  disk  for  later  processing,  A rBF0  is  therefore  related  to  NBF  x  as  fol¬ 
lows: 


NBF0  =  NBF x  + NBATCH  NC  NP]0in 

The  size  of  the  batches  can  be  computed  as  follows.  Let  NDP  be  the  total  number 
of  data  pages  staged  in  main  memory  during  join  processing  for  sorting  (HSMl  and 
HSM2)  of  building  hash  tables  (HHl  and  HH2),  excluding  buffers.  For  algorithms  HSMl 
and  HSM2,  NDP  =  I  R  I  +  I  S  I ,  since  these  algorithms  stage  batches  of  both  relations 
prior  to  joining  them.  For  algorithms  HHl  and  HH2,  NDP  =  I  R  I .  Let  NDPQ  be  the 
number  of  data  pages  staged  to  join  batch  0  and  NDP  {  be  the  number  of  data  pages 
staged  for  each  subsequent  batch.  Clearly, 

NDP  =  NDP  NBATCH- NDP  x 

We  want  to  maximize  NDP0  to  minimize  the  amount  of  intermediate  data  written  to 
disk.  We  therefore  assign  all  available  memory  to  stage  batch  0: 

NBF0+F  NDP0  <  NC’M 

with  the  equality  holding  unless  NBATCH  =  0.  Here,  F  represents  the  "universal  fudge 
factor",  a  number  slightly  greater  than  one  that  accounts  for  the  memory  overhead 
required  for  a  hash  table  or  binary  search  tree  of  tuples,  beyond  the  memory  required 
for  the  tuples  themselves.  To  maximize  NDP0,  we  must  minimize  NBFQ,  and  therefore 
NBATCH,  subject  to  the  constraint 


NBFl  +  F  NDPl  <  NC-M 


It  is  relatively  easy  to  derive  the  following: 


NBATCH  =  max(0, 


FNDP  +  NBFX-NCM 
NCM-  NBF  rNC •  NPjgU 


) 


With  NBATCH  and  the  above  equations,  we  can  then  calculate  the  data  size  \RBQ\, 
l  SB0 1 ,  etc. 


6.3. 1.1.4.  Analysis  of  Algorithm  HSMl 
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6. 3. 1.1. 5.  Analysis  of  Algorithm  HSM2 
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6. 3. 1.1. 6.  Analysis  of  Algorithm  HHl 
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6.3. 1.1. 7.  Analysis  of  Algorithm  HH2 
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6.3.2.  Parameter  Settings 

Three  types  of  parameters  are  used  in  the  comparisons:  architectural  parameters, 
timing  parameters,  and  workload  parameters.  The  parameter  values  used  are  listed 
below. 
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6.3.3.  Tests  and  Results 


Now  we  describe  the  tests  conducted  in  the  performance  comparisons.  Initial 
analysis  showed  that  the  sort  merge  algorithms  were  generally  much  slower  that  the 
hash-based  algorithms.  Therefore  we  show  the  results  for  the  hash-based  algorithms 
only.  Figures  6.2-6.10  show  the  results. 

6.3.3. 1.  Communication  versus  Performance 

In  this  test,  the  bandwidth  of  the  communications  line  was  varied  from  100  Mbps 
to  600  Mbps  to  study  the  effects  of  the  data  transfer  rate  on  the  performance  of  the 
algorithms.  The  results  are  shown  in  figure  6.2.  The  elapsed  times  of  type  1  algorithms 
drop  dramatically  when  the  bandwidth  in  aases  from  100  Mbps  to  300  Mbps.  This  is 
because  the  system  is  network  bound  with  our  typical  parameter  settings  for  these  algo¬ 
rithms.  In  other  words,  the  data  transfer  was  the  bottleneck  and  the  bandwidth  of  the 
communications  line  determined  their  elapsed  times.  In  contrast,  the  elapsed  times  of 
type  2  algorithms  did  not  change  at  all  when  the  bandwidth  was  varied  in  the  range. 
In  these  two  algorithms,  the  amount  of  data  transferred  equals  to  the  size  of  the  small 
relation  R .  It  was  not  the  bottleneck  when  the  bandwidth  is  greater  than  200  Mbps. 

6. 3. 3. 2.  System  Configuration  versus  Performance 

The  first  group  of  tests  that  investigated  the  effects  of  system  configurations  on  the 
performance  consists  of  the  following  five  tests. 


•  The  total  hardware  cost,  that  is,  the  total  number  of  disks  ( NC-ND ),  and  proces¬ 
sors  (NC'NP),  and  memory  size  (NC-M)  is  kept  as  constants.  The  number  of 
clusters  in  the  system  ( NC )  is  varied. 

•  The  configuration  of  each  cluster  is  kept  the  same,  (e.  g.  fix  NP  and  A rD),  the 
number  of  the  clusters  in  the  system  varies. 

•  The  number  of  disks  at  each  cluster,  ND ,  is  varied  and  other  parameters  are  kept 
as  constants. 

•  The  number  of  processors  at  each  cluster,  NP,  is  varied  and  other  parameters  are 
kept  as  constants. 

•  Memory  size  M  at  each  cluster  is  varied  while  other  parameters  were  kept  as  con¬ 
stants. 

We  will  describe  these  tests  in  more  detail. 

(1)  The  elapsed  time  versus  different  configurations  under  the  same  total  hardware  cost. 
That  is,  the  total  number  of  processors  and  disks,  and  the  total  size  of  memory 
banks  in  the  system  were  kept  as  a  constant.  The  number  of  clusters  were  varied 
and  the  number  of  disks  and  processors  and  the  size  of  memory  bank  at  each  clus¬ 
ter  were  varied  accordingly  to  keep  the  total  resources  constant.  One  extreme  of 
the  spectrum  is  that  all  resources  form  a  single  large  cluster.  Another  extreme  is  a 
system  such  as  Gamma  [DeWi86]  where  each  cluster  has  only  one  processor,  one 
disk  and  one  piece  of  memory.  The  results  are  shown  in  figure  6.3.  The  second 
extreme  case  is  not  shown  in  the  figure  since  the  trend  is  already  shown  when  the 
system  consists  of  64  clusters.  That  is  the  case  each  cluster  has  two  disks  and  two 
processors.  When  the  system  consists  of  128  clusters  with  one  processor  and  one 
disk,  the  elapsed  time  for  HSM-2  is  almost  doubled  compared  to  64  cluster  case. 
Other  3  curves  kept  flat  (not  shown  in  the  figure). 

From  figure  6.3,  it  can  be  seen  that  a  huge  single  cluster  provides  the  best  perfor¬ 
mance  since  the  communications  cost  is  eliminated.  The  curves  of  HH-2  and 
HSM-2  are  flat  since  the  system  is  network-bound.  This  will  be  seen  more  clearly 
later.  The  total  amount  of  data  transferred  in  the  type  2  algorithms  equals  the  size 
of  relation  R ,  which  is  a  constant  when  the  system  configuration  is  changed.  With 
the  large  number  of  clusters  (  >  32),  HSM-2  performs  poorly  since  replicating  rela¬ 
tion  R  increase  the  total  processing  cost.  The  increasing  CPU  cost  makes  the  sys¬ 
tem  CPU  bound.  When  the  number  of  clusters  is  doubled,  the  elapsed  time  is  also 

doubled.  For  the  type  1  algorithms,  increasing  the  number  of  clusters  increases 

NC-  1 

the  quantity  •(  I  I  R  I  I  +  I  I  S  I  I ),  which  is  the  quantity  of  data  transmitted 

NC 

over  the  network.  When  the  number  of  clusters  in  the  system  increases  from  2  to 
4,  this  amount  increases  one  third.  This  is  reflected  in  the  increase  of  the  elapsed 
time.  We  ignore  the  memory  contention  and  the  cost  of  synchronizing  the  con¬ 
current  access  of  disks  in  our  analysis.  The  processing  power  with  regard  to  disk 
I/O  and  CPU  processing  do  not  change  since  the  number  of  disks  and  processors 
are  kept  constants  no  matter  how  they  are  organized  into  clusters  during  all  the 


(2) 


(3) 


(4) 


(5) 


tests.  The  huge  single  cluster  case  is  just  an  indication  of  the  lower  bound  of  the 
elapsed  time.  It  is  impractical  to  put  a  large  number  of  disks  and  processors  with 
shared  memory  in  one  cluster. 

Figure  6.3  was  obtained  with  100  Mbps  network.  With  600  Mbps,  the  perfor¬ 
mance  is  a  little  different.  In  this  case,  the  type  1  algorithms  performed  better 
than  their  counterparts.  Figure  6.4  shows  the  results.  As  in  Figure  6.3.  the  type  2 
algorithms  performed  very  poorly  when  the  number  of  clusters  exceeded  64. 

The  elapsed  time  versus  the  number  of  clusters.  In  this  test,  the  configuration  of 
each  cluster  was  kept  the  same  and  the  number  of  clusters  in  the  system  was 
varied.  The  result  of  this  variation  is  to  increase  the  parallel  processing  power  of 
the  system  and  also  introduce  more  data  transfer  for  some  algorithms  since  we 
assume  that  the  original  data  is  scattered  around  the  system.  The  result  of  this 
test  is  shown  in  figure  6.5. 

The  elapsed  time  versus  the  number  of  disks  at  each  cluster.  In  these  tests,  the 
number  of  disks  is  varied  and  other  parameters  are  kept  as  constant.  Figure  6.t> 
shows  the  result.  It  can  be  seen  from  the  figure  that  it  is  unnecessary  to  attach 
more  disks  to  a  cluster  when  the  bottleneck  is  not  disk  I/O.  When  the  number  of 
disks  was  more  than  8  in  the  tested  case,  increase  of  the  number  of  disks  did  not 
bring  real  performance  benefit. 

The  elapsed  time  versus  the  number  of  processors  at  each  cluster.  In  these  tests,  the 
number  of  processors  at  each  cluster  was  varied  and  the  results  are  shown  in  figure 
6.7.  It  can  be  seen  from  the  figure  that  CPI"  processing  was  not  the  bottleneck 
even  with  2  processors  at  each  cluster.  Only  exception  was  the  HSM-2  algorithm 
which  needed  the  most  extensive  CPC  computation  among  these  algorithms.  How¬ 
ever.  with  more  than  8  processors  per  cluster,  the  elapsed  time  did  not  decrease 
further  when  more  processors  were  added  to  the  clusters.  Another  observation  is 
that,  in  our  buffer  allocation  scheme,  the  number  of  buffers  needed  increases  pro¬ 
portionally  to  the  square  of  number  of  processors  (not  linear).  The  large  number  of 
processors  may  cause  insufficient  memory  for  executing  the  algorithms. 

The  elapsed  time  versus  the  size  of  memory  bank  at  each  cluster.  In  this  group  of 
tests,  the  size  of  the  memory  bank  at  each  cluster  was  varied.  From  the  results, 
shown  in  figure  6.8.  it  can  be  seen  that  the  type  2  algorithms  required  more 
memory  space  for  buffers.  That  is.  the  minimum  memory  requirement  is  more 
strict  for  them.  However,  as  long  as  the  memory  was  big  enough  to  start  the  algo¬ 
rithm,  there  was  not  a  big  difference  in  the  elapsed  time  with  different  memory 
sizes.  This  can  be  explained  as  follows.  The  only  benefit  a  large  size  memory  pro¬ 
vide  is  to  save  the  disk  I/O  and  related  rehashing  of  the  first  batch  of  buckets. 
The  processing  of  the  remaining  buckets  will  not  be  affected  by  the  bucket  sizes 
which  are  determined  by  memory  size.  If  the  processing  cost  of  the  first  batch  is 
not  the  dominant  factor  of  the  total  processing,  or  the  disk  I/O  cost  is  not  the 
bottleneck  in  the  first  phase,  the  memory  size  will  not  affect  the  elapsed  time  a  lot 
as  seen  from  the  fir  re. 


a 


a 


> 


■m 


•  .»  mT  w*  | 


.v.v.s 


'  Vjs-.H 


V.V 


-  V*  i 


N.-V 


Kv 


-  - 

■V*.WV,VV 


,vjwwwv;w»; wwwuv*. 


>1 


-  125  - 


6. 3. 3.3.  Data  Sizes  versus  Performance 

The  third  group  of  tests  studied  the  effects  of  the  data  size  on  performance  of  the 
algorithms.  The  size  of  relation  R  ranged  from  1.7510  bytes  to  2510  .  I  15  I  I 
ranged  from  I  I  R  i  I  to  10- 1  I  R  i  .  Figure  6.9  depicts  the  relationship  between  the 
elapsed  time  and  the  relation  size.  Along  with  the  increase  of  the  size  of  two  relations, 
the  elapsed  time  of  all  algorithms  increased.  However,  the  type  1  algorithms  were  more 
sensitive  to  this  increase.  The  elapsed  time  increased  linearly  when  the  size  of  relations 
increased.  The  reason  for  this  is  that  the  bottleneck  in  these  tests  are  network.  The 
amount  of  data  transferred  increases  when  the  relation  sizes  increase.  Figure  6.10  shows 
the  same  system  and  relation  sizes  with  high  bandwidth  network  (600  Mbps).  The  first 
observation  is  that  the  type  1  algorithms  outperform  the  type  2  algorithms  when  the 
relations  were  small.  Second,  the  elapsed  time  of  all  algorithms  increases  to  some 
extent  when  the  relations  become  larger.  The  type  1  algorithms  were  still  more  sensi¬ 
tive  to  the  relation  sizes.  When  the  relation  sizes  become  larger,  their  performance 
become  worse  than  the  type  2  algorithms. 
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Figure  6.4.  Elapsed  Time  versus  Number  of  Clusters  (Fixed  Hardware  Costs) 
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Figure  6.7.  Elapsed  Time  versus  Number  of  Processors 
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Figure  6.8.  Elapsed  Time  versus  Memory  Size 
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6.4.  Conclusions 

In  this  chapter,  we  have  described  six  different  parallel  and  pipelined  join  algo¬ 
rithms.  Some  results  from  our  analysis  are  also  presented  to  compare  those  algorithms. 
The  results  of  our  performance  evaluation  reiterate  the  relative  performance  superiority 
of  the  hash  based  algorithms  compared  to  sort  based  algorithms.  Our  results  also  show 
how  the  effects  of  overlaps  among  the  different  steps  of  an  algorithm  affects  its  relative 
performance.  We  calculate  the  bottlenecks  in  the  alternative  join  algorithms  and  show 
that  the  performance  of  an  algorithm  improves  by  distributing  the  tasks  across  the  vari¬ 
ous  non-overlapping  stages  of  the  algorithm  so  that  maximum  overlap  and  equitable 
resource  utilization  are  achieved.  Our  results  show  that  intercluster  communication 
bandwidth  is  typically  a  bottleneck  and  thus  the  algorithm  or  system  configuration 
which  reduces  intercluster  data  transfer  is  preferred.  The  replicated  versions  of  the 
algorithms  perform  typically  better  than  their  non-replicated  counterparts  because  of 
the  reduced  intercluster  data  transfer. 

From  this  study,  we  can  conclude  some  basic  conclusions  regarding  the  parallel 
processing  of  join  operations  in  the  multiprocessor  environment. 

(1)  The  different  performance  shown  by  the  algorithms  studied  indicates  that  it  is 
important  to  choose  appropriate  algorithms  for  a  particular  join  operation  with  a 
given  system  configuration.  Furthermore,  with  a  given  system  and  relations  to  be 
joined,  the  query  optimizer  has  to  carefully  determine  the  number  of  cluster,  the 
number  of  disks  and  the  number  of  processors  which  will  be  used  in  the  join.  Gen¬ 
erally  speaking,  the  hash-based  algorithms  outperform  the  sort-merge  algorithms  if 
the  output  tuples  are  not  required  in  the  sorted  order.  However,  in  the  case  that 
the  source  relations  are  already  sorted,  or  the  applications  require  the  output  tuples 
are  sorted  on  the  join  attributes,  the  sort-merge  algorithms  may  be  advantageous. 
One  possibility  which  is  not  mentioned  is  to  use  an  order-preserving  hash  function 
in  the  hash-based  sort-merge  algorithms.  The  sorted  order  is  maintained  between 
different  buckets  and  the  final  output  tuples  can  thus  be  in  the  desired  sorted 
order.  The  use  of  an  order-preserving  hash  function  should  not  introduce  heavy 
extra  cost. 

(2)  In  multiprocessor-multidisk  systems,  high  parallelism  can  be  achieved  by  dividing 
the  total  processing  task  among  processors  and  disks  and  executing  the  subtasks 
concurrently.  However,  in  some  algorithms,  such  as  the  sort-merge  algorithms 
evaluated  in  this  study,  the  parallel  processing  becomes  difficult  for  some  steps 
(final  merge,  for  example).  The  increase  of  the  number  of  processes  cannot  speed 
up  the  processing.  On  the  other  hand,  the  hash-based  algorithms  are  naturally 
parallelizable.  Both  the  partitioning  and  joining  phase  can  be  concurrently  exe¬ 
cuted  by  all  participating  processors.  This  is  the  main  reason  that  explains  why 
the  hash-based  algorithms  outperform  the  sort-merge  algorithms  with  regard  to  the 
elapsed  time. 

(3)  Among  three  major  system  resources,  CPU,  disk  and  communication  network, 
CPU  seems  not  the  bottleneck  of  the  processing  pipeline  in  general  (only  in  some 
steps  of  the  sort-merge  joins  as  mentioned  above).  For  hash-based  algorithms  a 
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small  number  of  processors  at  each  cluster  is  enough  to  provide  the  necessary  pro¬ 
cessing  power.  On  the  other  hand,  disk  I/O  can  be  the  bottleneck  of  the  pipeline, 
although  we  intentionally  used  large  page  size  (32K)  and  very  high  disk-memory 
tranfer  rate  in  our  study.  One  possible  approach  is  to  increase  the  number  of  disks 
in  each  cluster.  This  multi-disk  system  can  efficiently  remove  the  bottleneck 
caused  by  slow  disk  I/O.  However,  the  number  of  disks  that  can  be  attached  to 
one  cluster  must  be  limited  by  the  complexity  of  control. 


For  joins  with  small  or  moderate  size  relations,  communications  cost  should  not  be 
a  dominant  factor  in  local  area  networks  [Lu85],  there  is,  however,  still  the  possibil¬ 
ity  that  the  communications  line  become  bottleneck  when  the  large  amount  of  data 
has  to  be  transferred  in  a  very  large  system  through  the  communications  line.  This 
is  especially  true  for  the  algorithms  where  a  large  amount  of  data  transfer  is 
required  (such  as  HSMl  and  HH1). 

(4)  One  key  point  in  the  design  of  a  parallel  processing  algorithm  is  to  achieve  max¬ 
imum  overlap  among  operations  requiring  different  resources  in  order  to  increase 
the  parallelism  and  reduce  the  effect  of  some  resource  which  is  the  bottleneck  of  the 
pipeline.  For  example,  in  the  hash-based  algorithms,  the  remotely  processed  tuples 
can  be  transferred  either  during  partitioning  or  right  before  their  use  in  the  joining 
phase.  The  total  communication  cost  is  the  same  in  these  two  schemes,  while  their 
overlapping  with  disk  I/O  is  different.  In  the  first  scheme,  all  communication 
occurs  while  the  relations  are  partitioned.  The  second  scheme  distributes  the  com¬ 
munications  cost;  each  relatively  small  amount  of  data  transfer  overlaps  with  disk 
I/O  and  CPU  processing  iin  joining  phases.  Which  scheme  is  better  will  depend  on 
the  relative  speed  of  disk  I/O  and  data  transfer  over  the  network.  This  example 
reminds  us  that  parallelism  between  different  type  of  resources  can  be  further 
increased  by  tuning  the  processing  steps  carefully  for  each  algorithm.  Further¬ 
more,  the  precise  analytical  analysis  of  such  parallel  processing  algorithms  is  very 
difficult.  Some  simulation  or  tests  in  real  systems  would  be  useful. 

Since  the  system  configuration,  that  is,  the  number  of  clusters,  the  number  of  pro¬ 
cessors,  the  number  of  disks,  and  the  size  of  memory  used  in  a  join  operation  affects  the 
performance  along  with  the  relation  size  and  selectivities,  query  optimization  in  this 
multiprocessor  environment  could  be  more  complicated,  and  also  more  important.  It 
might  be  a  useful  practice  to  more  thoroughly  investigate  the  relative  behavior  of 
different  algorithms  with  regard  to  the  parameters  and  derive  some  heuristics  which  can 
be  used  in  the  query  processing  process  for  such  a  data  flow  database  machine. 
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CHAPTER  7 

Fault  Tolerance  in  Very  Large  Data  Base  Systems 


A  system  for  very  large  databases  will  require  a  high  degree  of  parallelism  and, 
therefore,  will  involve  a  large  number  of  components.  In  such  a  system,  many  com¬ 
ponents  can  fail,  affecting  the  performance  of  active  components  and  the  availability  of 
the  system.  Designing  such  a  system  with  a  desired  degree  of  fault  tolerance  is  difficult 
and  requires  qualitative  and  quantitative  considerations  of  factors  affecting  system  fault 
tolerance. 

In  this  investigation,  we  study  the  effect  of  fault  tolerance  techniques  and  system 
design  on  system  availability.  Specifically,  we  attempt  to  answer  the  following  ques¬ 
tions:  What  are  the  main  parameters  that  affect  fault  tolerance  of  a  very  large  database 
system?  How  do  you  evaluate  their  effect  on  fault  tolerance?  How  important  are  vari¬ 
ous  fault  tolerance  techniques?  What  are  the  trade-offs  that  should  be  considered  when 
designing  a  very  large  database  system  with  a  desired  degree  of  fault  tolerance?  A  gen¬ 
eric  multiprocessor  architecture  is  used  that  can  be  configured  in  different  ways  to  study 
the  effect  of  system  architectures.  Important  parameters  studied  are  different  system 
architectures  and  hardware  fault  tolerance  techniques,  mean  time  to  failure  of  basic 
components,  database  size  and  distribution,  interconnect  capacity,  etc.  Quantitative 
analysis  compares  the  relative  effect  of  different  parameter  values.  Results  show  that 
the  effect  of  different  parameter  values  on  system  availability  can  be  very  significant. 
System  architecture,  use  of  hardware  fault  tolerance  (particularly  mirroring)  and  data 
storage  methods  emerge  as  very  important  parameters  under  the  control  of  a  system 
designer. 

7.1.  Introduction 

A  very  large  database  is  usually  heavily  used,  and  many  users  and  applications 
depend  on  it.  Downtime  or  unavailability  of  such  a  system  is  expensive  and  affects  crit¬ 
ical  applications  dependent  on  it.  A  system  can  become  unavailable  because  of  the 
faults  in  one  or  more  of  its  components.  To  increase  its  availability,  the  system  must 
tolerate  components  faults.  It  is  also  desirable  to  contain  or  tolerate  faults,  because 
recovery  after  a  failure  that  affects  the  data  can  be  very  costly  if  the  database  has  to  be 
reconstructed. 

A  system  that  gives  good  performance  and  manages  a  very  large  database  has 
many  components  and  a  high  degree  of  parallelism.  In  such  a  system,  even  though  any 
individual  component  may  be  fairly  reliable,  the  probability  that  one  of  the  components 
will  fail  becomes  much  higher  than  the  probability  that  an  individual  component  will 
fail.  Since  many  components  need  to  cooperate  in  a  parallel  processing  system,  achiev¬ 
ing  fault  tolerance  is  more  difficult.  However,  because  the  system  is  large  and  there  is 
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more  redundancy,  there  are  more  opportunities  to  achieve  fault  tolerance. 

The  topic  of  fault  tolerance  in  computers  and  database  system  has  received  a  fair 
degree  of  attention.  Articles  by  Kim  [Kim84]  and  Siewiorek  [Siew84|  discuss  architec¬ 
tures  of  fault-tolerant  computers  and  fault  tolerance  techniques  used  in  many  commer¬ 
cial  and  prototype  systems.  Some  of  the  publications  that  discuss  fault  tolerance  tech¬ 
niques  used  in  commercial  and  prototype  computers  are 
[Siew78.  Borr8L.  Kast83.  Borr84.  Bern85j  and  [Gray86].  Recently,  work  continues  on 
larger  systems,  such  as  Teradata  DBC1012  [Deck86j,  but  not  much  has  been  published 
on  the  fault  tolerance  techniques  used  in  such  systems  or  on  evaluation  of  such  tech¬ 
niques.  While  it  seems  that  many  of  the  techniques  developed  for  smaller  systems  are 
applicable  to  larger  systems,  designing  fault  tolerance  for  large  systems  is  more  difficult 
because  they  have  many  more  components  that  can  fail  and  there  are  more  alternatives 
to  provide  fault  tolerance.  (Koon86j  and  jShetSTj  evaluate  fault  tolerant  algorithms  in 
a  relatively  small  (~  10  nodes)  and  loosely  coupled  distributed  systems.  However,  there 
is  no  published  research  on  evaluating  fault  tolerance  in  a  tightly  coupled  multiproces¬ 
sor  system  for  large  databases  (  >  100  gigabytes). 

In  this  chapter,  we  study  the  effects  of  various  system  parameters  and  fault  toler¬ 
ance  techniques  in  a  system  for  very  large  databases.  We  use  a  generic  multiprocessor 
architecture  which,  by  choosing  different  system  parameters  can  represent  a  range  of 
system  architectures,  from  a  loosely  coupled  multiprocessor  with  non-shared  memory 
and  partitioned  database  to  a  tightly  coupled  multiprocessor  with  shared  database  and 
shared  memory.  We  study  the  effects  of  various  architectures,  fault  tolerance  tech¬ 
niques.  mean  time  to  failure  of  important  components,  database  size  and  distribution, 
interconnect  capacity,  etc.  on  the  availability  of  a  system.  The  results  show  that  the 
choice  of  different  system  architectures  and  some  parameter  values  significantly  affect 
availability.  Because  fault  tolerance  is  obtained  at  the  cost  of  additional  system 
resources  jin  terms  of  number  of  redundant  components  and  system  time  and  resources 
required  to  maintain  redundant  information),  a  designer's  task  is  to  minimize  such  cost 
to  obtain  the  desired  level  of  fault  tolerance.  The  results  can  help  a  designer  to  under¬ 
stand  important  trade-offs  and  choose  an  appropriate  system  architecture  and  fault 
tolerance  techniques. 

Th  is  chapter  is  organized  as  follows.  Section  7.2  describes  the  generic  system  archi¬ 
tecture  we  evaluate.  Section  7.3  describes  basic  concepts  and  terminology  as  well  as 
various  fault  tolerance  techniques.  Section  7.4  defines  two  availability  measures,  one  of 
which  we  evaluate  in  detail  to  measure  fault  tolerance.  Section  7.5  describes  the  quanti¬ 
tative  analysis  and  the  results.  Section  7.6  discusses  our  conclusions. 

7.2.  System  Description 

Achieving  good  performance  in  a  very  large  database  system  (\TDBS)  requires  a 
large  amount  of  computing  power.  Given  the  limitation  of  the  computing  power  of  a 
single  component,  we  use  a  high  degree  of  parallelism.  A  \~LDBS  has  these  characters- 
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1.  It  has  many  major  components  such  as  processors,  disks  and  memory. 

2.  The  relations  are  very  large  and  distributed  over  many  components  for  storage  and 
parallel-processing.  Because  of  the  very  large  size  of  the  database,  the  amount  of 
replication  possible  is  limited. 

3.  An  update  transaction  will  change  ma.,y  data  items.  This  means  that  reintegration 
of  failed  components  can  take  longer. 

We  assume  a  generic  multiprocessor  architecture  for  a  VLDBS,  as  shown  in  Figure 
7.1.  This  architecture  consists  of  a  set  of  "clusters"  linked  by  an  interconnect.  Each 
cluster  consists  of  a  set  of  processors,  a  shared  memory  bank  addressable  by  all  proces¬ 
sors  in  the  cluster,  and  a  set  of  disk  storage  units  and  associated  controllers.  Each  clus¬ 
ter  has  its  own  power  supply.  All  the  components  within  a  cluster  (processors, 
disks/controllers,  memory  and  power  supply)  are  connected  by  an  intracluster  bus.  The 
size  of  a  VLDBS  is  defined  by  the  architectural  parameters  given  in  Table  7.1.  A  com¬ 
ponent  unit  (e.g.,  a  processor  unit  or  disk  unit)  consists  of  one  or  more  components.  A 
component  unit  consisting  of  k  components  is  called  a  k-redundant  unit  in  which  k-1 
components  are  redundant  components  used  to  increase  fault  tolerance.  However,  all 
active  components  in  a  unit  perform  the  same  function,  so  the  redundancy  does  not  add 
to  the  computing  power  (in  the  case  of  the  processor  unit)  or  storage  capability  (in  the 
case  of  the  disk  unit  or  memory  unit).  Our  architecture  has  one  power  supply  unit,  one 
memory  unit,  and  an  intracluster  bus  per  cluster. 


Figure  7.1.  Architecture  of  a  VLDBS 
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A'C  Number  of  clusters 

AP  Number  of  processor  units  per  cluster 

ND _ Number  of  disk  units  per  cluster _ 

Table  7.1:  Architectural  Parameters 

7.3.  Fault  Tolerance  Techniques 

In  this  section,  we  discuss  basic  concepts  and  principles  of  fault  tolerance  (section 
7.3.1),  hardware  fault  tolerance  techniques  (section  7.3.2),  software  fault  tolerance  tech¬ 
niques  (section  7.3.3),  and  data  storage  methods  (section  7.3.4). 

7.3.1.  Basic  Concepts 

A  component  or  a  subsystem  that  functions  according  to  its  specifications  is  called 
active.  Three  terms  are  relevant  to  the  operation  of  the  components  and  the  system: 
fault,  error,  and  failure  [Siew82,  Aviz84].  A  fault  is  a  defect  in  a  component  of  the  sys¬ 
tem.  Faults  result  in  errors,  which  are  undesired  or  invalid  component  states.  Errors 
may  result  in  failures,  which  mean  loss  of  the  service  expected  from  the  component  (or 
the  system  of  which  the  component  is  a  part).  Errors  may  be  transient,  intermittent,  or 
permanent.  The  first  two  are  also  referred  to  as  soft  errors,  while  permanent  errors  are 
referred  to  as  hard  errors.  Siewiorek  and  Swarz  [Siew82]  estimate  that  soft  errors 
account  for  more  than  90%  of  all  faults.  Failures  from  soft  errors  are  called  soft 
failures,  while  those  from  hard  errors  are  called  hard  failures.  The  fault  tolerance  tech¬ 
niques  should  protect  the  system  against  soft  as  well  as  hard  failures. 

After  the  failure  of  a  component  (or  subsystem),  the  system  may  take  a  corrective 
action  called  failure  recovery.  This  may  involve  reconfiguring  the  system  to  isolate  the 
failed  component  and  to  reorganize  the  system  so  that  it  can  be  restarted  without  the 
failed  component.  Once  the  failed  component  is  repaired,  it  is  reintegrated  with  the  sys¬ 
tem.  In  a  multiple-component  system  the  failure  of  a  single  component  that  affects  no 
other  component  during  the  recovery  is  called  a  single  failure.  Simultaneous  failure  of 
more  than  one  component,  or  a  failure  of  a  component  while  the  system  is  recovering 
from  a  previous  component  failure  are  called  multiple  failures  (or  double  failures,  if  two 
components  fail  as  defined).  Normally,  the  probability  of  a  multiple  failure  is  very  low. 
After  a  failure  of  one  or  more  components,  a  system  that  can  continue  to  operate  at 
lower  efficiency  corresponding  to  the  loss  of  power  associated  with  the  failed  components 
is  called  a  gracefully  degradable  system. 

The  following  criteria  are  desirable  in  a  highly  available  database  system  [Kim84]: 

1.  The  system  must  guarantee  database  consistency  by  providing  transaction  process¬ 
ing  with  concurrency  control,  distributed  commit,  and  recovery  techniques. 

2.  The  system  must  support  automatic  recovery  when  failure  occurs.  A  backup  pro¬ 
cess  should  automatically  take  over  when  the  primary  process  fails. 
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3.  The  system  must  survive  at  least  a  single  failure  (and  possibly  multiple  failures)  of 
major  system  components,  including  the  processor,  disk  drives,  memory,  and  inter¬ 
process  communication  medium. 

4.  The  system  should  support  on-line  reintegration  of  failed  components  when  they 
are  repaired  or  replaced. 

5.  Other  features,  such  as  the  automatic  restart  of  transactions  affected  by  failures 
and  the  rerouting  of  messages  to  bypass  failed  communication  route,  are  desired. 

Two  basic  principles  are  used  to  achieve  fault  tolerance: 

1.  Modularity  —  By  modularizing  the  system,  the  modules  become  units  of  failure  and 
replacement. 

2.  Redundancy  —  By  having  a  redundant  module  (a  hardware  or  a  software  resource), 
the  primary  component  can  be  replaced  with  the  redundant  component  if  it  fails. 

The  principle  of  modularity  must  be  taken  into  account  while  designing  the  system. 
The  multiprocessor  system  evaluated  is  made  modular  by  providing  various  hardware 
fault  tolerence  techniques  to  tolerate  component  failures,  and  by  designing  a  clustered 
architecture  so  that  the  system  can  be  partially  available  using  reconfiguration  when 
some  clusters  fail. 

Redundancy  is  a  basic  property  required  for  all  fault  tolerance  techniques.  It  is 
used  to  provide  information  needed  to  negate  the  effect  of  failures  [Siew84].  Redun¬ 
dancy  can  be  obtained  by  having  extra  components,  or  physical  redundancy ,  and  by 
having  extra  time,  or  temporal  redundancy.  Physical  redundancy  is  used  as  hot  stand¬ 
bys  (or  backups),  as  checkers  that  mask  faults  via  voting,  and  to  reconfigure  the  system 
around  faulty  components.  Temporal  redundancy  is  used  for  retrying  operations  to 
recover  from  transient  or  soft  errors. 

The  basic  physical  redundancy  technique  used  to  increase  the  availability  of  each  of 
the  components  include  duplication  (2-redundancy),  triplication  (3-redundancy)  and  vot¬ 
ing,  and  k-redundancy.  We  will  briefly  discuss  the  first  two  techniques. 

In  the  duplication  technique,  two  identical  components  (processors,  disks,  or 
memory)  are  used  in  parallel.  Duplication  can  be  used  in  two  ways.  One  technique  is 
to  use  the  second  component  as  a  hot  standby.  In  this  technique,  if  either  component 
fails,  the  other  component  takes  over.  This  technique  is  especially  useful  to  provide 
fault  tolerance  against  hard  failures.  Examples  of  this  type  of  duplication  are  disk  mir¬ 
roring  and  memory  mirroring,  which  are  discussed  in  the  next  subsection.  In  the  second 
duplication  technique,  the  two  components  are  driven  by  the  same  input,  and  their  out¬ 
puts  are  compared.  If  the  outputs  are  not  the  same  (i.e.,  they  vote  differently),  the 
operation  is  retried.  This  technique  is  especially  useful  for  providing  fault  tolerance 
against  soft  failures  and  for  detecting  potential  hard  failures.  An  example  of  this  type 
of  duplication  is  processor  pairing,  also  discussed  in  the  next  subsection. 

The  triplication  component  is  usually  used  for  the  triple  modular  redundancy 
(TMR)  technique.  Here  three  components  are  used  in  parallel.  Outputs  of  all  com¬ 
ponents  are  compared,  and  the  component  unit  continues  to  function  as  long  as  at  least 
two  outputs  match  (i.e.,  at  least  two  votes  are  received).  The  technique  is  most  useful 
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for  masking  errois  by  voting.  Thus,  it  is  helpfui  for  soft  failures  but  not  hard  failures. 


7.3.2.  Hardware  Fault  Tolerance 


In  this  subsection,  we  will  examine  the  fault  tolerance  techniques  for  the  major  sys¬ 
tem  components.  These  include  processors,  disks,  memory,  power  supply,  intracluster 
bus,  and  interconnect. 


7. 3. 2.1.  Processor  Pairing 

The  reason  for  duplication  of  processors  is  to  check  for  soft  errors  and  increase 
fault  tolerance  by  eliminating  them.  Like  other  components,  a  processor  experiences 
two  types  of  failures,  hard  and  soft.  The  mean  time  to  failure  (MTTF)  for  hard  failures 
for  microprocessors  that  are  used  in  highly  parallel  systems  is  estimated  to  be  100  Khrs. 
The  rate  of  soft  failures  is  estimated  to  be  an  order  of  magnitude  higher:  hence  a  soft 
failure  MTTF  is  estimated  at  10  Khrs.  With  a  high  degree  of  parallelism,  this  failure 
rate  can  have  a  disastrous  effect. 


In  processor  pairing,  the  same  input  and  clock  drive  two  processors,  but  the  system 
uses  output  of  only  the  primary  processor.  Thus,  the  second  processor  does  nor  add 
computing  power  to  the  system.  However,  the  output  of  the  secondary  processor  is  used 
to  compare  with  that  of  the  primary  processor,  and  differing  output  signals  an  error. 
An  error  triggers  the  instruction  retry.  Thus  the  processor  pairing  attempts  to  mask 
soft  failures  at  the  instruction  level.  It  also  helps  to  prevent  error  propogation.  If  the 
error  persists  after  several  retries,  it  is  identified  as  a  hard  error,  and  both  the  proces¬ 
sors  turn  themselves  off  after  raising  an  interrupt.  Microprocessor-based  fault-tolerant 
system  vendors,  such  as  Stratus  [Kast83]  and  Sequoia  [BernSoj  have  taken  similar 
approaches.  Gray  (Gray86j  discusses  several  approaches  for  designing  processor  pairs. 
It  is  also  possible  to  use  the  processor  pairs  to  tolerate  hard  failures  by  using  a  different 
scheme,  but  we  will  handle  hard  processor  failures  by  using  redundancy  in  processor 
units  in  a  cluster  and  keep  the  technique  for  processor  pairing  simple. 


7. 3. 2. 2.  Disk  Mirroring 

In  most  systems,  disks  are  the  biggest  reliability  problem,  because  they  are  complex 
systems  containing  a  fair  number  of  mechanical  components.  Disks  contain  circuitry  to 
detect  and  tolerate  soft  failures;  so  the  \TDBS  does  not  deal  with  soft  failures  within  a 
disk.  The  hard  failure  MTTF  of  a  disk  is  estimated  to  be  10K  hours.  Disk  mirroring, 
also  called  duplexing,  is  used  to  increase  this  MTTF. 

The  technique  used  for  mirroring  disks  is  genericallv  called  reconfigurable  duplica¬ 
tion  [Siew82].  A  disk  unit  in  this  duplication  technique  consists  of  two  disks.  Both 
disks  receive  the  same  input  through  independent  paths  and  controllers  and  store  on 
respective  media.  Thus,  both  disks  are  in  sync.  When  both  disks  are  active,  one  acts  as 
a  primary  and  the  other  as  a  hot  standby.  If  either  of  the  disks  fails,  then  the  other 
disk  takes  over.  The  system  can  run  with  only  one  disk  operating  correctly.  When  a 
failed  disk  is  repaired,  its  contents  are  updated  so  that  both  disks  again  contain  the 
same  information.  A  repair  may  mean  replacement.  Tandem  uses  the  disk  mirroring 
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technique  in  its  fault  tolerant  systems. 

7. 3. 2. 3.  Techniques  for  Memory 

The  shared  memory  in  our  architecture  is  a  random  access  memory.  As  with  other 
components,  the  memory  unit  suffers  hard  failures  as  well  as  soft  failures.  The  soft 
failures  can  be  classified  as  intermittent  and  transient  failures  and  are  much  more  fre¬ 
quent  than  hard  failures.  Intermittent  failures  occur  under  conditions  such  as  system 
overload,  while  transient  failures  are  due  to  external  conditions  such  as  voltage  fluctua¬ 
tions. 

Techniques  to  tolerate  soft  failures  include  error  detection  and  correction  techniques 
such  as  parity  codes.  In  case  of  hard  failures,  the  failed  chip  or  a  memory  bank  contain¬ 
ing  the  failed  chip  is  replaced.  A  hard  failure  can  be  tolerated  by  a  reconfiguration. 
which  isolates  the  failed  chip.  Duplication  is  another  fault  tolerance  technique  some¬ 
times  used.  The  duplication  can  be  used  to  detect  soft  failures  as  in  processor  pairs,  or 
it  can  be  used  for  mirroring  as  in  mirrored  disks  to  tolerate  hard  failures. 

7.3. 2. 4.  Techniques  for  Power  Supply 

Two  components  of  power  need  to  be  considered:  power  that  comes  to  the  system 
as  a  whole,  called  power  source,  and  the  power  that  is  supplied  to  each  of  the  clusters, 
called  power  supply.  The  power  coming  from  the  source  is  usually  conditioned.  To 
improve  the  fault  tolerance,  we  expect  that  an  uninterrupted  power  supply  (ITS)  will 
be  used.  A  UPS  uses  a  battery  that  immediately  takes  over  if  the  normal  power  source 
fails.  When  the  system  draws  power  from  the  battery,  it  may  be  run  at  reduced  capa¬ 
city  to  save  power.  Next,  we  expect  that  each  cluster  will  use  at  least  one  power  sup¬ 
ply.  By  having  an  independent  power  supply  for  each  cluster,  the  failure  of  a  power 
supply  can  directly  affect  only  one  cluster.  If  the  cluster  size  is  big.  more  than  one 
power  supply  per  cluster  is  recommended. 

7. 3. 2. 5.  Techniques  for  Intracluster  Bus 

This  component  connects  all  the  processors,  memory,  and  disks  within  a  cluster. 
We  expect  it  to  be  a  short  (within  one  cabinet)  and  very  high  speed  bus.  It  will  occa¬ 
sionally  suffer  from  soft  failures,  but  we  assume  that  the  low  level  protocols  and  the 
hardware  will  tolerate  them.  A  hard  failure  of  this  component  is  extremely  infrequent, 
but  if  it  occurs,  it  can  be  modeled  as  the  failure  of  one  or  more  components  that  are 
affected. 

7. 3. 2. 6.  Techniques  for  interconnect 

The  type  and  frequency  of  failures  in  an  interconnect  that  connects  all  the  clusters 
in  the  system  will  depend  on  the  type  of  interconnect  (e.g..  ring,  hypercube,  bus)  and  its 
topology.  For  the  sake  of  simplicity,  we  will  assume  a  generic  interconnect.  An  inter¬ 
connect  may  suffer  a  soft  'ink  failure,  which  may  corrupt  or  lose  data.  These  errors  can 
be  detected  by  check  codes  and  time  outs.  Most  of  these  errors  are  recoverable  by  the 
protocol  (e.g..  retransmission)  or  taken  care  of  by  error  correction  codes.  If  the  soft 
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error  is  not  recoverable,  the  transaction  is  aborted.  Hard  failures  can  occur  because  of 
the  physical  breakup  of  a  link  or  failure  of  the  cluster’s  connection  to  the  interconnect. 
The  basic  technique  to  handle  hard  failures  is  duplication  in  the  interconnect  (e.g.,  dou¬ 
ble  ring).  Each  cluster  can  have  independent  paths  to  two  interconnects  to  handle  hard 
failures. 

7.3.3.  Software  Fault  Tolerance 

Basic  software  fault  tolerance  techniques  can  be  divided  into  three  components:  (1) 
transaction  fault  tolerance  for  failures  that  affect  transaction  execution,  (2)  system 
software  fault  tolerance  for  failures  that  affect  system  software  (e.g.,  operating  system 
and  communication  software),  (3)  and  data  fault  tolerance  for  failures  that  affect  data 
availability. 

A  soft  failure  may  occur  in  the  system  because  of  transient  or  intermittent  failures 
in  its  hardware  components  or  software.  These  errors  may  corrupt  the  system  software 
as  well  as  fail  transactions.  For  example,  a  transient  error  in  the  shared  memory  may 
corrupt  the  system  software.  Alternately,  such  a  failure  may  corrupt  a  transaction’s 
work  space.  The  techniques  discussed  in  transaction  fault  tolerance  (section  7. 3.3.1)  and 
system  software  fault  tolerance  (section  7. 3. 3. 2)  aim  to  tolerate  such  soft  failures.  Sec¬ 
tion  7. 3. 3. 3  discusses  tolerating  hard  failures  that  have  a  more  serious  impact  on  system 
functioning  and  require  more  elaborate  recovery  schemes  such  as  the  one  discussed  in 
section  7. 3. 3. 4. 

7. 3. 3.1.  Transaction  Fault  Tolerance 

In  this  subsection  we  discuss  transaction  failure  and  recovery. 

A  transaction  is  a  set  of  operations  performed  on  the  database.  It  should  have  four 
properties  (recognized  as  ACID  properties),  [Haed83,  Gray86],  atomicity,  consistency, 
integrity,  and  durability.  Two  of  these  properties  are  of  main  interest  to  us: 

•  Atomicity  —  Either  all  or  none  of  the  operations  in  the  transaction  should  be  per¬ 
formed.  The  transaction  commits  if  all  the  operations  are  performed;  otherwise  it 
aborts.  This  property  should  be  guaranteed  even  in  case  of  failures  (this  property 
is  sometimes  called  failure  atomicity). 

•  Durability  —  Once  a  transaction  commits,  all  of  its  effects  must  be  preserved,  even 
if  there  are  failures. 

Transaction  fault  tolerance  techniques  aid  the  concurrency  control  mechanism  to 
guarantee  the  four  properties  even  if  there  are  failures.  In  particular,  a  two-phase  com¬ 
mit  protocol  is  used  in  a  distributed  database  environment  to  guarantee  that  either  all 
or  none  of  the  copies  of  data  are  updated  when  a  transaction  executes.  If  there  is  a 
failure  that  prevents  the  transaction  from  completing  all  its  operations,  none  is  per¬ 
formed  and  the  transaction  is  aborted. 

A  transaction  may  fail  because  of  a  hardware  (component)  fault  or  a  software  fault 
in  the  system.  Examples  of  the  faults  are  incorrect  execution  of  the  transaction  (e.g.,  a 
divide  by  zero);  failure  of  a  component  on  which  the  transaction  was  executing  (e.g.,  a 
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processor  failure);  and  a  transient  error  in  the  transaction’s  work  space  in  memory.  The 
approach  to  handling  these  failures  is  simple.  In  two-phase  commit,  a  temporary  copy  of 
data  resulting  from  updates  is  stored  on  the  disk  (called  safe  storage).  If  a  failure  affects 
the  transaction,  the  transaction  is  aborted.  If  the  failure  has  resulted  from  a  soft  fault 
such  as  transient  error  in  the  transaction  work  space,  the  transaction  is  restarted.  If  the 
failure  has  resulted  from  a  hard  fault  but  is  recovered  (e.g.,  by  reconfiguration  as  dis¬ 
cussed  in  section  7. 3. 3.4),  the  transaction  is  restarted.  For  example,  if  the  processor 
executing  the  transaction  fails,  the  transaction  is  aborted  and  restarted  by  assigning  it 
to  another  processor  in  the  same  cluster.  If  the  disk  in  a  cluster  fails,  the  cluster  fails. 
In  this  case,  all  transactions  in  the  failed  cluster  are  aborted.  After  the  recovery,  which 
is  discussed  in  the  next  subsection,  the  transactions  may  be  restarted. 

When  a  failure  affecting  a  transaction  occurs,  it  is  important  to  efficiently  abort  the 
transaction  and  bring  the  database  and  the  system  to  a  consistent  state.  There  are  two 
basic  models  of  transaction  execution  that  lead  to  different  techniques  for  transaction 
abort.  One  model  is  called  an  UNDO  model,  in  which  the  transaction  writes  directly  in 
the  database  while  executing  ("write  in  place").  When  the  transaction  is  aborted,  the 
transaction  mechanism  performs  UNDO  operations  for  all  the  operations  performed  by 
the  aborted  transaction.  In  the  other  model,  called  the  work  space  model,  a  transaction 
writes  in  a  work  space  (also  called  differential  files)  until  it  is  committed.  Upon  commit¬ 
ment,  it  writes  the  changes  stored  in  the  temporary  database  into  the  permanent  data¬ 
base.  If  a  transaction  is  aborted,  the  work  space  is  simply  discarded.  We  prefer  the 
latter  model. 

To  aid  in  transaction  restart  and  system  reconfiguration,  at  least  two  copies  of 
transaction  information  are  maintained  in  the  system  (see  section  7. 3. 3. 2). 

7. 3. 3. 2.  System  Software  Fault  Tolerance 

In  this  subsection,  we  discuss  how  to  tolerate  failures  that  affect  the  system 
software.  We  will  only  discuss  the  soft  failures  that  corrupt  the  system  software 
because  the  hard  failures  are  handled  as  in  the  data  fault  tolerance. 

We  considered  transaction  failures  in  the  previous  subsection.  We  use  the  tradi¬ 
tional  way  of  reinitialization  (or  reboot)  to  recover  from  the  failures  that  corrupt  the 
system  software.  The  reinitialization  may  be  limited  to  the  cluster  in  which  corruption 
occurs.  Many  systems  today  have  multiple  levels  of  system  reboot  procedures,  often 
referring  to  a  more  extensive  reboot  as  a  cold  reboot  and  to  a  less  extensive  or  limited 
reboot  as  a  warm  reboot.  Proper  care  in  developing  system  software,  especially  recovery 
techniques,  will  limit  software  reinitialization  to  a  warm  reboot. 

7.3. 3. 3.  Data  Fault  Tolerance 

Availability  of  the  database  greatly  depends  on  how  the  data  is  distributed  and 
duplicated.  The  data  fault  tolerance  is  attained  by  proper  placement  of  data.  Our  pri¬ 
mary  method  is  to  have  two  copies  of  each  data  item.  To  do  this,  we  first  horizon¬ 
tally  fragment  a  relation  and  assign  different  horizontal  fragments  to  different  clusters. 
A  fragment  assigned  to  a  cluster  is  further  vertically  fragmented,  and  different  vertical 
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fragments  are  assigned  to  different  disk  units  in  a  cluster.  In  this  scheme,  multiple  disk 
units  within  a  cluster  are  used  to  improve  response  time  but  not  fault  tolerance.  Later 
we  briefly  discuss  a  data  quadruplication  scheme  in  which  multiple  units  within  a  clus¬ 
ter  are  also  used  to  improve  fault  tolerance. 

Maintaining  two  copies  and  multiple  fragments  of  each  copy  can  also  simplify  and 
improve  the  retrieval  performance  of  the  system,  but  we  will  not  address  that  issue 
here.  We  use  three  rules  in  our  primary  scheme  to  distribute  the  database: 

1.  The  database  is  partitioned  across  the  clusters.  Thus  the  failure  of  a  cluster  will 
make  only  the  fragment(s)  assigned  to  it  unavailable. 

2.  Each  partition  assigned  to  a  cluster  is  uniformly  distributed  over  all  the  disk  units 
in  that  cluster.  It  may  be  noted  that  the  purpose  of  having  multiple  disks  in  a 
cluster  is  to  increase  the  parallelism  and  not  the  fault  tolerance. 

3.  The  two  copies  of  data  (fragment  in  our  scheme)  cannot  reside  in  the  same  cluster 
for  the  sake  of  fault  tolerance. 

Fragment  Allocation  Scheme 

Now  let  us  discuss  how  fragments  are  allocated  to  different  clusters.  It  is  impor¬ 
tant  to  allocate  the  fragments  of  each  of  the  two  copies  to  the  clusters  in  an  intelligent 
way  because  it  affects  the  fault  tolerance.  For  example,  consider  dividing  the  database 
into  four  fragments  so  that  two  copies  of  each  fragment  are  distributed  over  four  clus¬ 
ters.  Two  of  the  possible  data  distribution  schemes  are  shown  in  Figures  7.2  and  Figure 
7.3.  In  both  schemes,  the  database  will  remain  available  if  there  is  a  single  failure  of 
clusters  because  one  copy  of  every  fragment  will  still  be  accessible.  In  the  first  scheme 
the  database  will  be  available  in  two  out  of  a  possible  six  double  failures  of  clusters.  In 
the  second  scheme  the  database  will  be  available  in  four  out  of  a  possible  six  double 
failures  of  clusters.  Thus,  the  data  placement  in  the  second  case  is  preferable.  Because 
the  probability  of  double  failures  is  negligible,  we  will  not  address  recovery  of  double  (or 
multiple)  failures  of  clusters. 

Now  let  us  look  at  how  the  system  can  be  reconfigured  in  terms  of  data  placement 
after  one  or  more  single  failures  of  clusters.  Consider  the  failure  of  cluster  1  when  the 
scheme  shown  in  Figure  7.2  is  used.  In  this  case,  one  copy  each  of  fragment  FI  and 
fragment  F4  will  be  lost,  and  the  other  three  clusters  have  only  one  copy  of  fragments 
Fl  and  F4.  Although  it  is  possible  to  continue  processing  the  transactions,  it  is  not 
desirable  to  do  so,  because  it  will  not  allow  the  system  to  be  gracefully  degradable. 
Failure  of  one  more  cluster  may  leave  the  system  with  no  copy  of  a  fragment.  For 
example,  failure  of  cluster  2  would  mean  that  no  copy  of  fragment  Fl  will  be  available, 
and  so  the  system  can  no  longer  process  transactions. 

The  solution  is  to  create  a  second  copy  of  the  fragments  that  are  unavailable 
because  of  a  cluster  failure.  Following  the  data  placement  rules  discussed  above,  the 
system  of  three  working  clusters  can  be  reconfigured  as  shown  in  Figure  7.4.  Following 


the  same  arguments,  if  cluster  3  were  to  fail  after  reconfiguration  before  cluster  1  failed 
(as  shown  in  Figure  7.4)  but  before  cluster  1  is  reintegrated,  the  system  can  be 
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Figure  7.2.  Data  Placement  Scheme  1 
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Figure  7.3.  Data  Placement  Scheme  2 

reconfigured  as  shown  in  Figure  7.5.  Because  at  least  two  copies  of  data  must  be  kept 
in  the  working  system  and  the  two  copies  cannot  be  stored  on  the  same  cluster,  the  sys¬ 
tem  can  tolerate  faults  until  two  clusters  are  left  in  the  working  system. 

After  a  failed  cluster  is  repaired,  it  is  reintegrated  into  the  system.  This  involves 
replacing  (or  updating)  the  copies  of  the  fragments  on  the  repaired  cluster  with  the  up- 
to-date  copy  in  the  rest  of  the  working  system.  For  example,  if  cluster  1  was  repaired 
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Figure  7.4.  Data  Placement  After  Cluster  1  Fails 
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Figure  7.5.  Data  Placement  After  Cluster  3  Also  Fails 

after  the  situation  shown  in  Figure  7.5,  fragments  FI  and  F4  will  be  updated  using 
either  of  the  two  up-to-date  copies  on  clusters  2  and  4,  fragment  FI  will  be  deleted  from 
cluster  4,  and  fragment  F4  will  be  deleted  from  cluster  2.  The  process  of  replacing  (or 
updating)  can  occur  in  the  background  for  the  active  clusters  (clusters  2  and  4  in  this 
case)  for  most  of  the  time.  However,  just  before  cluster  1  is  made  active,  there  may  be 
a  brief  pause  in  the  system  to  allow  for  the  reallocating  transaction  and  for  updating 
the  global  data  allocation  directories.  We  leave  out  some  details  of  cluster  reintegra¬ 
tion. 

The  above  scheme  is  based  on  data  duplication.  If  the  fault  tolerance  achieved  by 
maintaining  two  copies  does  not  meet  the  requirement,  the  amount  of  data  replication 
can  be  increased.  However,  this  can  be  done  only  at  the  cost  of  performance  since  the 
more  copies  of  a  data  item,  the  more  overhead  of  storage  and  update  synchronization. 
We  will  briefly  discuss  a  data  quadruplication  strategy,  which  may  improve  the  fault 
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tolerance  substantially. 

The  quadruplication  strategy  maintains  two  copies  of  data  within  one  cluster  and 
two  additional  copies  in  another.  The  two  copies  within  the  clusters  can  be  distributed 
in  two  ways,  giving  two  different  quadruplication  schemes.  In  the  first  scheme,  a  disk 
unit  consists  of  mirrored  disks.  In  this  case,  logically,  there  are  only  two  copies  of  every 
fragment,  but  physically  there  are  four  copies  of  fragments.  The  hardware  keeps  two 
identical  copies  on  every  mirrored  disk.  Since  system  software  uses  the  addressing,  the 
data  update  algorithm  knows  about  only  two  copies.  In  the  second  scheme,  the  same 
strategy  of  fragmentation  used  across  the  clusters  is  also  used  now  for  the  two  copies 
across  the  disk  units  within  a  cluster.  In  this  case,  the  data  update  algorithm  knows 
about  the  four  copies  and  has  to  send  update  commands  to  all  four  copies  separately. 
Thus,  in  this  scheme  the  cluster  cannot  fail  if  a  disk  unit  goes  down  in  a  cluster  (just  as 
the  system  will  not  go  down  if  one  cluster  goes  down).  We  feel  that  the  first  scheme  is 
simpler  and  has  less  overhead  when  there  are  no  failures.  Overheads  for  the  two 
schemes  when  failures  occur  remain  to  be  studied. 

7. 3. 3. 4.  Recovery  From  Hard  Failures 

A  hard  failure  is  usually  more  serious,  requiring  more  expensive  methods  to  tolerate 
it.  There  are  two  types  of  hard  failures  with  respect  to  the  type  of  recovery  required  — 
one  in  which  the  availability  of  the  data  is  not  affected  and  the  other  in  which  the  data 
becomes  unavailable.  The  latter  type  is  like  a  media  failure  in  traditional  systems  and 
requires  more  an  elaborate  recovery  process.  An  example  of  the  first  type  of  failure  is  a 
hard  failure  of  a  processor  unit  (a  single  processor  or  a  processor  pair).  An  example  of  a 
second  type  of  failure  is  the  failure  of  the  shared  memory  or  the  failure  of  a  disk  unit  (a 
disk  or  mirrored  disks),  both  of  which  lead  to  the  cluster  failure  in  our  primary  scheme 
of  data  distribution. 

A  component  suffering  a  hard  failure  remains  out  of  commission  for  a  relatively 
long  time,  and  we  cannot  let  the  system  be  unavailable  for  that  time.  The  process  we 
use  to  recover  from  hard  failures  is  called  reconfiguration.  In  this  process,  the  following 
actions  are  taken:  The  failed  component  is  logically  isolated  from  the  system  so  that  it 
can  be  either  repaired  or  replaced.  Secondly,  in  some  cases,  spare  replacement  com* 
ponents  can  readily  replace  the  failed  components.  If  a  replacement  is  made,  the  system 
can  function  at  the  original  capacity  once  the  reconfiguration  is  complete.  If  a  replace¬ 
ment  is  not  made,  the  system  will  function  at  a  lower  capacity  proportional  to  that  loss 
because  of  the  component  failure.  Third,  the  functions  of  the  failed  component  are 
redistributed  to  the  functioning  components.  If  a  part  of  data  becomes  unavailable, 


then  that  data  is  made  available  by  making  a  copy  of  it. 

Next  let  us  look  at  recovery  when  a  cluster  fails  and  data  availability  is  also 
affected.  Let  us  first  discuss  the  important  design  features  that  help  in  the  recovery. 

Adjacency  —  There  are  many  ways  in  which  a  cluster  failure  can  be  detected.  For  the 
sake  of  brevity,  we  will  discuss  only  one  possible  scheme.  We  arrange  all  the  clusters  in 
a  logical  unidirectional  ring.  Periodically,  each  cluster  sends  an  "I  am  alive"  message  to 
the  cluster  to  its  left.  If  a  cluster  does  not  receive  this  message  within  a  predetermined 


y<y>M 


V'V/A-J 


•  r rn 


-  150  - 


time,  it  will  communicate  with  the  node  to  its  right  to  ascertain  that  it  is  dead. 

Duplicate  Transaction  Information  —  It  is  necessary  to  keep  at  least  two  copies  of 
information  on  the  active  transactions  in  the  system.  To  store  this  information,  each 
cluster  maintains  a  transaction  table.  Information  about  each  active  transaction  can  be 
maintained  in  two  transaction  tables,  one  in  the  transaction  table  in  the  cluster  of  tran¬ 
saction  origin,  and  other  in  the  transaction  table  in  the  cluster  to  its  left. 

Global  Data  Allocation  Directory  —  A  global  data  allocation  directory  is  a  bidirec¬ 
tional  list  of  data  fragment  —  cluster. 

Duplicate  Lock  Information  —  Each  cluster  maintains  a  lock  table.  When  a  transac¬ 
tion  accesses  a  data  item,  a  lock  is  set  on  both  copies  of  the  data  item,  which  are  in 
different  clusters.  The  respective  iock  tables  maintain  this  information. 


Failure  Handling  Method 


Soft  Failure 

Processor  Pairing 

Processor 

Hard  F ailure 

i  Transaction  Recovery,  or 

I  Software  Reinitialization,  or 

System  Reconfiguration  (in  extreme  cases) 

Disk 

Soft  F ailure 

Hard  Failure 

Error  Correction  Codes  and 

Hardware  Techniques 

Disc  Mirroring,  or 

System  Reconfiguration 

Parity  and  Error  Correction  Codes,  or 

Soft  Failure 

Transaction  Recovery,  or 

Memory 

Partial  Failure 

Software  Reintialization 

Memory  Reorganisation 

Hard  F ailure 

Memory  Mirroring,  or 

System  Reconfiguration 

Power  Supply 

Soft  Failure 

Hard  Failure 

Hardware  Techniques 

System  Reconfiguration 

Intracluster  Bus 

Soft  Failure 

Hard  Failure 

Hardware  Techniques 

Treated  as  failure  of  associated  component 

Interconnect 

Soft  Failure 

Protocol/Retries,  and 

Error  Correction  Codes 

Hard  Failure 

Redundant  Paths 

Table  7.2:  Failure  Tolerance  Techniques  As  Applied  To  Various  Failures 

In  certain  failures,  it  is  necessary  to  isolate  the  failed  component  and  continue  the 
system  operation.  When  a  component  is  isolated,  access  to  some  data  may  be  lost  until 
the  failed  component  is  repaired.  Our  data  storage  scheme  dictates  that  at  least  two 
copies  of  any  data  must  be  available  to  allow  tolerating  any  additional  failures  and 
graceful  system  degradation.  The  reconfiguration  scheme  depends  on  system  hardware 
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and  software  configuration,  including  the  data  storage  scheme.  We  will  now  discuss  the 
system  recovery  scheme  used  to  tolerate  some  of  the  hard  failures  that  result  in  cluster 
failure.  We  will  assume  the  data  duplication  scheme  defined  earlier. 

1.  Detect  Cluster  Failure.  Some  of  the  hard  failures  result  in  the  cluster  failure. 
A  failure  of  a  cluster  is  detected  using  the  adjacency  feature  described  above.  For 
simplicity,  we  will  discuss  a  centralized  reconfiguration  algorithm  in  which  the 
adjacent  cluster  that  detects  the  cluster  failure  coordinates  the  reconfiguration. 

2.  Prepare  for  reconfiguration.  Upon  detection  of  a  cluster  failure,  the  system 
enters  a  pause  state.  During  this  step,  all  the  committed  transactions  are  com¬ 
pleted.  All  the  transactions,  which  are  active  but  not  yet  committed,  are  aborted. 
This  step  requires  accessing  and  updating  appropriate  transaction  tables  and  lock 
tables.  The  aborted  transactions  are  restarted  when  the  system  leaves  the  pause 
state.  The  system  remains  in  the  pause  state  until  the  reconfiguration  is  completed. 
While  the  system  is  in  the  pause  state,  newly  arriving  transactions  are  queued 
behind  the  transactions  to  be  restarted. 

3.  Find  what  is  missing  and  locate  the  redundant  copy.  The  coordinator 
accesses  its  global  data  allocation  directory  to  find  out  the  data  (fragments)  that 
was  (were)  stored  on  the  failed  cluster  and  to  locate  the  respective  redundant  data 
(fragments).  We  call  the  cluster  with  redundant  data  source  clusters. 

4.  Decide  the  Destination  Clusters.  These  are  the  clusters  where  the  second  copy 
of  the  data  has  to  be  created  from  the  copy  of  data  on  the  source  clusters. 

5.  Send  the  Data  to  the  Destination  Clusters.  The  coordinator  instructs  the 
source  clusters  to  send  the  relevant  data  to  destination  clusters.  A  destination  clus¬ 
ter  decides  how  to  store  the  data  on  its  disks. 

6.  Update  the  Global  Data  Directories.  The  coordinator  sends  appropriate  infor¬ 
mation  to  all  the  active  clusters  to  update  their  respective  global  directories. 

7.  Restart.  The  coordinator  signals  all  O.K.  to  all  the  clusters.  The  system  then 
leaves  the  pause  state  and  start  processing  transactions  in  its  queues. 

After  the  repair  of  one  or  more  failed  components,  a  cluster  may  be  ready  to  be 
reintegrated  with  the  rest  of  the  active  system.  The  reintegration  process  requires  that 
the  repaired  cluster  is  updated  with  respects  to  the  data.  Most  of  this  process  can  be 
performed  in  the  background.  For  the  sake  of  brevity,  we  do  not  provide  further 
details. 

7.4.  Measuring  Fault  Tolerance 

The  parameter  used  most  widely  to  characterize  the  fault  tolerance  of  a  system  is 
its  availability.  To  measure  the  availability  of  a  system  and  to  devise  fault  tolerance 
techniques  to  achieve  the  desired  level  of  system  fault  tolerance,  we  should  also  be  able 
to  measure  availability  of  each  of  the  system  components. 
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7.4.1.  Component  Availability 

The  term  fault  tolerance  reflects  the  ability  of  a  system  or  component  to  tolerate  a 
fault  and  remain  active.  As  discussed  earlier,  availability  of  a  component  or  a  system 
reflects  its  fault  tolerance.  Two  parameters  of  a  component  are  relevant  in  calculating  a 
component’s  availability:  MTTF,  mean  time  to  failure,  and  MTTR,  mean  time  to 
repair.  MTTF  refers  to  the  average  time  that  elapses  between  two  consecutive  failures 
of  a  component.  MTTR  refers  to  the  average  time  needed  to  detect  and  repair  a  failed 
component.  The  availability  of  a  component  can  simply  be  given  by  the  ratio, 

MTTF 

Availability  —  (7-1) 

MTTF  +  MTTR 

Estimates  of  the  MTTF  and  MTTR  of  each  component  of  a  system  should  be 
known  before  system  availability  can  be  estimated.  Table  7.3  gives  the  estimates  of 
MTTF  for  the  components  that  we  will  use  in  our  analysis.  These  estimates  have  been 
derived  from  the  literature. 


Component  MTTF  Estimates 

Processor 

100K  hours 

Memory  (64  MB  RAM) 

100K  hours 

Disc 

10K  hours 

Communication 

100K  hours 

Power  Supply 

100K  hours 

Table  7.3:  Component  MTTF  Estimates 


7.4.2.  System  Availability 

Availability  of  a  fault  tolerant  database  system  can  be  defined  in  one  of  two  ways. 
We  call  them  strong  availability  criteria  [Smit86]  and  weak  availability  criteria 
[Shet87j. 

Strong  Availability:  Strong  availability  of  a  database  system  is  given  by  the  percent 
of  the  time  the  entire  database  is  available  for  access  by  authorized  users,  i.e., 

time  the  entire  database  is  available 

S_Avail  = - x  100  (7.2) 

total  time 


Weak  Availability:  Weak  availability  of  a  database  system  is  given  by  the  percent  of 
the  time  a  transaction  can  be  processed  without  being  aborted  because  of  a  failure  in 
the  system.  The  system  is  available  as  far  as  the  transaction  s  .bmitted  to  the  system 
can  be  processed  without  a  delay  or  with  a  reasonable  delay.  An  example  of  a  reason¬ 
able  delay  is  the  recovery  time  for  one  soft  failure  during  a  transaction  execution.  Let: 

P(nf)  =  Probability  that  no  failure  occurs  while  a  transaction  executes,  and 

P(sf)  =  Probability  that  there  are  no  multiple  failures  during  the  transaction 
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execution  and  that  the  system  recovers  from  every  single  failure.  In  other  words,  this 
is  a  probability  that  the  system  goes  into  a  pause  state  at  least  once  during  a  transac¬ 
tion  execution. 

Then, 

W_Avail  =  P(nf)  +  P(sf)  (7.3) 

There  may  be  two  reasons  for  using  S_Avail  as  the  availability  criteria  [Kim84,  Smit86]. 
First,  it  is  easy  to  quantify  S_Avail  as  compared  with  W_Avail.  Second,  it  is  easier  to 
justify  that  the  specified  fault  tolerant  techniques  meet  certain  availability  goals. 

7.4.3,  Assumptions 

Our  analysis  is  not  a  detailed  analysis  of  a  particular  system.  We  make  several 
simplifying  assumptions  without  compromising  the  basic  nature  of  the  problem.  Some  of 
the  important  assumptions  are  as  follows: 

1.  Each  system  component  or  module  is  fail  fast,  i.e. ,  it  either  functions  properly  or 
stops  [Schl83]. 

2.  Each  component  fails  and  recovers  independently. 

3.  Mean  time  to  failure  of  a  component  is  exponentially  distributed  with  mean 
MTTF.  Similarly,  mean  time  to  repair  of  a  component  is  exponentially  distributed 
with  mean  MTTR. 

4.  Components  (particularly  interconnect)  are  lightly  loaded  during  recovery.  To 

include  the  queuing  effect  in  calculating  MTTR  ^  ,  a  more  detailed  model  (e.g. 

[Sbet85]  )  needs  to  be  used. 


7.5.  Quantitative  Analysis 

In  this  section,  we  will  study  the  availability  of  a  VLDBS.  In  section  7.5.1,  we 
study  the  availability  of  subsystems  that  consist  of  multiple  non-redundant  and  redun¬ 
dant  components.  A  VLDBS  comprises  the  subsystems  of  components.  In  sections  7.5.2 
and  7.5.3,  we  study  the  mean  time  to  failure  of  a  cluster  and  time  to  recover  from  a 
cluster  failure,  respectively.  These  two  parameters  are  used  to  calculate  system  availa¬ 
bility  of  a  VLDBS  in  section  7.5.4.  Each  of  these  sections  discusses  basic  quantitative 
methods  to  calculate  output  parameters  using  input  parameters,  followed  by  examples 
and  evaluations  using  a  range  of  input  parameter  values. 

7.5.1.  Component  Availability 

In  this  subsection,  we  calculate  the  availability  of  subsystems  consisting  of  multiple 
nonredundant  components  (section  7. 5. 1.1)  and  multiple  redundant  components  (sec- 
tion7.5.1.2). 
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7. 5. 1.1.  Availability  of  Subsystems  Consisting  of  Nonredundant  Components 

A  system  with  no  redundant  components  requires  that  all  the  components  are 
active  for  the  system  to  be  active.  Such  a  system  is  also  called  a  series  system.  Let  the 
mean  time  to  failure  of  the  i  th  component  be  MTTF '.  .  Then  the  mean  time  to  failure 
of  a  system  of  n  components,  MTTF  ,  is  given  by: 

stT-sys 


MTTF 


ser  —  sys 


2 - 

.  =  1  MTTF , 


If  a  subsystem  has  k  identical  components  and  if  mean  time  to  failure  of  each  com¬ 
ponent  is  MTTF ,  then  mean  time  to  failure  of  the  subsystem,  MTTF  ,  is  given  by 

K-st t ~ sys 

reducing  (equation  7.4).  i.e., 

MTTF 


MTTF 


k-  ser  -  sys 


7. 5. 1.2.  Availability  of  Subsystem  Consisting  of  Redundant  Components 

In  section  7.3.2  we  discussed  using  redundancy  to  increase  fault  tolerance  of  a  com¬ 
ponent.  Consider  a  subsystem  that  has  reconfigurable  duplication  as  in  disk  mirroring. 
Such  a  subsystem  consists  of  a  pair  of  identical  components  working  in  parallel  and  per¬ 
forming  the  same  task.  The  subsystem  can  perform  the  required  task  as  long  as  at  least 
one  of  the  two  components  is  active.  Using  a  combinatorial  model  of  system  availabil¬ 
ity,  we  can  map  this  subsystem  to  a  two  module,  two  repairman  model  (or  alternatively 
a  Markovian  queue  called  M/M/2/2/2  queue)  as  follows.  We  note  that  the  system  can 
be  in  one  of  the  three  states  (see  Figure  7.6).  In  state  0,  both  components  are  active. 
In  state  1,  one  of  the  component  has  failed,  but  the  other  is  active.  In  state  2,  both  the 
components  have  failed.  An  arc  joining  two  states  specifies  the  rate  at  which  the  sub¬ 
system  changes  from  the  state  at  the  tail  of  the  arc  to  the  state  at  the  head  of  the  arc. 
The  two  state  transition  rates  shown  over  the  arc  are  defined  as  follows  (assume  mean 
time  to  failure  and  mean  time  to  repair  of  each  component  to  be  MTTF  and  MTTR, 
respectively). 
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Figure  7.6.  Model  of  Reconfigurable  Duplication 
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repair  rate  =  p  = 


MTTR 


Now  availability  of  this  subsystem.  A  .  can  be  calculated  from  the  probability  that  the 

pCLIT  r  J 

system  will  be  either  in  state  0  or  state  1  and  can  be  given  as  follows  [Siew82],  page 
280: 


2Ap  +  p 


\2  +  2Ap  +  p2 


Repair  time  for  this  subsystem,  MTTR  .  ,  is  given  by  MTTR/2,  because  the  sub¬ 
system  is  modeled  as  failed  when  it  is  in  sffate  2,  and  it  is  repaired  when  it  comes  to 
state  1.  This  happens  at  the  rate  2p.  Now,  the  mean  time  to  failure  of  the  subsystem, 


MTTF  ^  is  given  by  using  equation  7.1. 


We  can  extend  the  above  two  component  subsystems  with  one  redundant  com¬ 
ponent  to  a  1  component  subsystem  with  k- 1  redundant  components.  This  subsystem 
fails  only  when  none  of  its  components  is  active.  It  can  be  modeled  by  a  it  module,  k 
repairmen  model  (or  an  M/M/k/k/k  queue).  Availability  of  this  system,  Ak  ^ 
can  be  given  by  extending  equation  6  as  follows. 


(A  +  p)*  -  A* 


k  -  redundant 


(A  +  p)‘ 

Practically,  this  availability  quickly  approaches  1  for  the  values  of  the  mean  time 
to  failure  given  in  Table  7.3.  We  can  show  that  the  mean  time  to  repair  for  this  subsys¬ 
tem  is  inversely  proportional  to  k.  Corresponding  value  of  the  mean  time  to  failure  of 
the  subsystem  will  be  very  large  as  compared  with  the  mean  time  to  failure  of  the  indi¬ 
vidual  component. 


7. 5. 1.2.1.  Availability  of  Hardware  Components  with  Redundancy 

Now  let  us  look  at  the  effect  of  some  of  the  hardware  fault  tolerance  techniques  dis¬ 
cussed  in  section  7.3.2.  Table  7.4  gives  the  parameters  of  interest. 


MTTF,.  . 

auk 

MTTF 

_proc 

MTTF 

mem 

MTTF.  , 

mt 

MTTF 

pi 


mean  time  to  failure  of  a  single  disk 
mean  time  to  failure  of  a  single  processor 
mean  time  to  failure  of  a  single  memory 
Mean  time  to  failure  of  a  single  interconnect 
Mean  time  to  failure  of  a  single  power  supply 


Table  7.4:  Component  Fault  Tolerance  Parameters 


Effect  of  Disk  Mirroring:  Disk  mirroring  is  used  to  tolerate  hard  failures  of  disk 
units.  By  disk  mirroring,  the  availability  of  the  disks  can  be  greatly  improved. 
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However,  this  fault  tolerance  is  achieved  at  the  cost  100%  overhead  in  number  of  com¬ 
ponents. 

Example:  Consider  a  subsystem  consisting  of  mirrored  disks  with  MTTF ,  to  be 

disk 

10K  hours  and  the  mean  time  to  repair,  MTTR to  be  24  hours.  By  using  equation 
(6),  availability  of  a  disk  unit  of  mirrored  disks,  evaluates  to  0.999942675  and  the 
mean  time  to  failure,  MTTF^  to  239  years. 

Example:  A  duplication  scheme  can  also  be  used  for  memory.  For  MTTF  of  100K 

mem 

hours,  the  mean  time  to  failure  of  a  memory  unit  consisting  of  mirrored  memory  will  be 
23,793  years!  Similar  cases  arise  for  dual  interconnect  and  dual  power  supply.  The 
effect  of  using  dual  interconnect  and  dual  power  supply  can  be  calculated  similarly. 

Effect  of  Processor  Pairing:  The  use  of  redundancy  in  processor  pairing  as  discussed 
in  section  7.3.2  is  quite  different  from  that  in  disk  mirroring.  This  is  because  disk  mir¬ 
roring  is  used  to  tolerate  hard  faults,  while  processor  pairing  is  used  to  tolerate  soft 
faults.  By  processor  pairing,  all  the  soft  faults  are  detected.  Since  many  of  the  errors 
are  transient,  many  of  the  faults  do  not  recur.  For  recurring  soft  faults,  the  transaction 
is  aborted  or  the  system  code  is  reinitialized.  If  the  soft  error  is  persistent  after  several 
retries,  it  is  manifested  as  a  hard  failure.  The  mean  time  to  failure  corresponding  to 
the  soft  faults  is  estimated  to  be  10K  hours  or  about  ten  times  as  frequent  as  the  hard 
failures.  The  processor  pairing  tolerates  these  soft  faults.  For  any  hard  failure,  an 
interrupt  is  generated  and  the  processor  pair  detaches  itself  from  the  system.  Thus, 
unlike  disk  mirroring,  the  processor  pair  does  not  continue  to  work  if  one  of  the  two 
processors  in  a  pair  fails. 

Example:  For  a  processor  pair  to  be  active,  both  the  processors  should  be  active. 

Thus  if  mean  time  to  failure  of  each  processor,  MTTF  ,  is  100K  hours,  then  by 

processor  J 

using  equation  7.5,  mean  time  to  failure  of  a  processor  unit,  MTTF  ,  consisting  of  pro- 

pu 

cessor  pair  will  be  50K  hours. 

Effect  of  Processor  Redundancy:  Processor  redundancy  is  used  to  tolerate  hard 
failures  that  disable  processor  units  (which  may  be  a  single  processor  or  a  processor 
pair).  This  is  done  by  using  multiple  processor  units  in  a  cluster.  Since  at  least  one 
processor  unit  should  be  active  for  the  cluster  to  be  active,  processor  redundancy 
decreases  the  probability  of  a  cluster  failure  due  to  processor  failures.  Processor  redun¬ 
dancy  also  contributes  to  an  increase  in  the  processing  power. 

Example:  If  a  cluster  contains  two  processor  units,  such  that  each  unit  is  a  processor 

processor  pair  with  the  mean  time  to  failure  of  each  unit,  MTTF  to  be  50K  hours, 

then  the  mean  time  to  failure  of  the  cluster  due  to  processor  failures,  MTTF  will  be 

proes 

5950  years.  If  a  cluster  contains  three  processor  units,  MTTF  will  be  8,269,650 

procs 

years! 

7.5.2.  Parameters 

Many  parameters  affect  the  availability  of  a  system.  Table  7.4  defines  component 
parameters.  Table  7.5  defines  system  parameters.  The  last  three  parameters  in  Table 
7.5  are  called  output  parameters;  the  rest  are  called  input  parameters.  Range  or 
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multiple  values  of  input  component  and  system  parameters  are  used  to  study  the  effect 
of  parameter  values  on  system  availability.  Table  7.6  gives  range  values  and  default 
values  used  in  our  quantitative  evaluations.  If  the  value  of  a  parameter  used  during  a 
calculation  is  not  explicitly  mentioned,  then  the  default  values  should  be  assumed. 


NC 

Number  of  clusters  in  the  system 

ND 

Number  of  disk  units  per  cluster 

NP 

Number  of  processor  units  per  cluster 

DBS 

Database  size 

DF 

Distribution  factor 

C 

Interconnect  capacity 

PS 

Page  size 

I 

I/O  access  time 

MTTF  ,  , 

cluster 

Cluster  MTTF 

MTTR  , 

cluster 

MTTR  for  a  cluster  failure 

Av 

System  availability 

Table  7.5:  Important  System  Parameters 

The  distribution  factor  ( DF )  is  given  by  the  average  number  of  clusters  on  which 
the  second  copy  of  data  stored  on  one  cluster  is  stored.  For  example,  the  DF  of  the 
data  placement  scheme  shown  in  Figure  7.2  is  2,  and  the  DF  of  the  data  placement 
scheme  shown  in  Figure  7.3  is  1.  I/O  access  time  is  the  time  to  transfer  one  page  from 
a  disk  to  memory. 


Parameter 

Values/ Range 

Default 

MTTF..  , 

10K  or  30K  (hours) 

30K  (hours) 

MTTF  ' 

100K  (hours) 

100K  (hours) 

_ 

MTTF*™ 

mem 

- 

MTTF  , 

100K  (hours) 

- 

MTTF 

100K  (hours) 

- 

NC  P‘ 

1024  to  2 

- 

ND 

1  to  512 

- 

NP 

1,  2  or  ND/2 

ND/2 

DBS 

10**11  (100  Giga)  or  10**12  (1  Tera) 
byte 

10**12  (byte) 

DF 

1,  NC/ 2  or  NC- 1 

NC/  2 

C 

100M,  500M,  or  1000M  (bits) 

500M  (bits) 

PS 

4K,  16K,  or  64K  (bytes) 

16K  (bytes) 

I 

20  ms 

- 

Table  7.6:  Input  Parameter  Values 

Depending  on  the  component  fault  tolerance  techniques  used  in  a  system,  we  get 
different  system  configurations.  We  identify  three  system  configurations  of  interest  to 


our  study.  These  configurations  are  given  in  Table  7.7. 


Configuration  1 

Single  disk,  processor  pair,  single  memory,  single  interconnect,  single 
power  supply 

Configuration  2 

Mirrored  disk,  processor  pair,  single  memory,  single  interconnect,  single 
power  supply 

Configuration  3 

Mirrored  disk,  processor  pair,  dual  memory,  dual  interconnect,  dual 
power  supply 

Table  7.7:  System  Configuration 


To  simplify  the  presentation  without  reducing  the  scope  of  evaluation,  we  will 
assume  that  all  the  clusters  are  identical  and  all  the  components  of  the  same  type  have 
the  same  characteristics.  We  also  assume  that  the  total  number  of  disk  units  in  the  sys¬ 
tem  will  be  1024  (i.e.,  NC  *  ND  =  1024)-  If  each  disk  unit  uses  mirroring,  there  will  be 
twice  as  many  disks  in  the  system.  We  will  assume  the  data  storage  scheme  described 
in  section  7. 3. 3. 3.  Because  this  scheme  uses  data  duplication,  the  total  amount  of 
storage  capacity  required  in  the  system  will  be  twice  that  of  DBS.  Thus  if  the  DBS  is 
100  gigabytes,  then  each  disk  unit  should  have  a  storage  capacity  of  (100  *  2  /  1024)  or 
nearly  200  megabytes.  If  the  DBS  is  1  terabyte,  then  each  disk  unit  should  have  a 
storage  capacity  of  nearly  2  gigabytes. 

In  this  study  we  focus  only  on  the  fault  tolerance.  Thus  we  do  not  consideration  in 
detail  the  query  processing  and  optimization  issues.  However,  it  should  be  noted  that 
these  issues  are  very  dependent  on  the  data  storage  scheme  used.  Fault  tolerance  is  also 
very  dependent  on  the  data  storage  scheme  used.  But  a  data  storage  scheme  that  is 
good  for  fault  tolerance  may  not  be  very  good  for  query  processing.  This  presents  a 
very  important  trade-off  that  we  do  not  address  in  this  study. 

7.5.3.  Mean  Time  to  Failure  of  a  Cluster 

A  cluster  is  active  if  all  of  the  following  hold  true: 

1.  All  of  the  ND  disk  units  are  active, 

2.  At  least  one  of  the  NP  processor  units  is  active, 

3.  Shared  memory  unit  is  active, 

4.  Interconnect  and  connection  to  it  is  active,  and 

5.  Power  supply  unit  is  active. 

Thus,  by  using  equation  7.1,  MTTF eiu)ter  is  given  by: 


v' 


MTTF, 


MTTF, 


MTTF. 


MTTF. 


MTTF. 


MTTF. 


where  MTTF ,  is  the  MTTF  of  the  system  of  ail  the  disk  units  in  the  cluster, 

duks 

MTTF  is  tne  MTTF  of  the  processor  units  in  the  cluster,  MTTF  is  the  MTTF  of 

procs  <  #  mu 

the  shared  memory  unit  in  the  cluster,  MTTF .  ,  is  the  MTTF  of  interconnect  and  con- 

ini 

nection  to  it,  and  MTTF  is  the  MTTF  of  power  supply  unit  in  the  cluster.  Each  of 

these  items  are  discussed  f>elow. 

•  All  of  the  disk  units  of  a  cluster  must  be  active  in  an  active  cluster.  This  is 
because  of  the  way  we  distribute  the  data,  namely,  the  data  allocated  to  a  cluster  is 
vertically  fragmented  and  different  fragments  are  stored  on  different  disks.  Hence 
all  the  fragments  are  required  to  construct  a  copy  of  a  relation.  MTTF  can  be 
given  by  equation  7.5  because  it  is  a  system  consisting  of  k  identical  disk  units 
working  in  parallel  and  each  of  the  units  should  be  available  for  all  the  system  to 
be  available.  Thus: 


MTTF A 


MTTF , 


where  MTTF ^  is  the  MTTF  of  a  disk  unit.  Each  disk  unit  may  be  comprised  of  a 
single  disk  or  mirrored  disks.  Section  7.5.1  discusses  how  to  calculate  MTTF  for 
mirrored  disks. 

At  least  one  of  the  processors  units  must  be  active  in  an  active  cluster.  When  a 
processor  fails,  transaction  fault  tolerance  aborts  the  transactions  being  executed  by 
the  failed  processor.  As  long  as  at  least  one  of  the  processor  units  is  active,  the 
processing  will  continue  in  the  cluster,  but  at  reduced  speed.  Such  a  property  of 
the  system  is  called  graceful  degradation.  A  processor  unit  is  either  a  single  proces¬ 
sor  or  a  processor  pair.  Subsection  5.1  discusses  how  to  calculate  MTTF  when  a 
cluster  contains  several  processor  units. 

The  shared  memory  must  be  active  in  an  active  cluster.  The  shared  memory  could 
be  either  nonredundant  or  redundant.  If  it  uses  memory  mirroring,  MTTF  can 
be  calculated  in  a  way  similar  to  the  mirrored  disks. 

An  interconnect  and  the  cluster’s  communication  to  it  must  be  active  for  an  active 
cluster.  The  system  may  have  single  interconnect  or  dual  interconnects  (i.e.,  two 
interconnects  with  independent  communication  controllers  at  each  cluster). 
MTTF.  A  of  a  dual  interconnect  can  be  calculated  as  in  the  case  of  mirrored 

tnt 

memory. 

The  power  supply  must  be  active  for  the  cluster  to  be  active.  A  cluster  may  have 
a  single  or  a  dual  power  supply.  MTTF  can  be  calculated  as  in  the  case  of  dual 

i  •  i  Plu 

interconnect  and  mirrored  memory. 


7. 5.3.1.  MTTF  Results 

There  are  many  parameters  that  influence  MTTF  ^  .  Among  the  important 

ones  are  the  size  of  the  cluster  (primarily  identified  by  Nlf),  fault  tolerance  techniques 
used  for  each  of  its  components  (i.e.,  configuration),  the  reliability  (or  mean  time  to 
failure)  of  each  of  its  components,  and  the  data  storage  scheme.  As  noted  in  before,  we 
quantitatively  evaluate  only  one  data  storage  scheme,  which  we  discussed  in  section 
7.3.3.3. 

The  larger  the  size  of  a  cluster,  the  more  parallelism  and  storage  capacity  within  a 

cluster.  However,  there  are  more  components  in  a  cluster  that  can  fail.  Thus  the 

MTTF  ,  will  decrease.  Two  important  parameters  that  decide  cluster  size  are  ND 
cluster 

and  NR  In  terms  of  fault  tolerance  and  our  study,  ND  plays  a  significantly  more  dom¬ 
inant  role  because  of  the  following  reasons: 

•  MTTF ,  is  smaller  than  MTTF  because  a  disk  has  mechanical  components 

disk  t  '  processor 

and  hence  is  inherently  less  reliable. 

•  Usually  ND  is  larger  than  NP  since  the  memory  contention  limits  the  number  of 
processors  that  can  be  used  in  parallel. 

•  ND  also  decides  the  size  of  the  database. 

Because  of  the  importance  of  ND  and  the  sensitivity  of  all  the  three  output  param¬ 
eters  with  respect  to  it,  we  plot  it  on  the  x-axis  for  all  the  graphs.  The  output  parame¬ 
ters  are  plotted  on  the  y-axis. 

The  fault  tolerance  techniques  require  redundancy  and  hence  more  components. 
The  system  configurations  reflect  the  hardware  fault  tolerance  techniques  used  in  the 
system.  Figure  7.7  compares  the  MTTF  ^  of  cluster  using  different  configurations. 
Configuration  3,  which  uses  hardware  fault  tolerance  techniques  for  all  of  its  com¬ 
ponents,  performs  significantly  better. 

Because  the  MTTF  of  a  disk  is  significantly  smaller  than  the  MTTF  of  other  com¬ 
ponents,  MTTF .  has  a  very  significant  effect  on  MTTF  .  .  This  effect  is  shown  in 

disk  cluster 

Figure  7.8.  Note  that  the  scale  on  the  y-axis  is  logarithmic  (i.e.,  it  doubles  every  unit). 

Figure  7.9  compares  MTTF  ^  of  a  cluster  that  has  one  processor  with  a  cluster 
that  has  more  than  one  processor.  The  difference  between  the  curves  for  NP  —  1  and 
NP  &  2  is  significant  for  smaller  values  for  ND.  At  larger  values  for  ND,  MTTF ^ 
becomes  a  bottleneck  for  both  cases;  so  the  difference  is  not  significant.  Also,  having 
more  than  two  processors  does  not  increase  the  MTTF elutUr  noticeably  because  the 
value  of  1  /  MTTF  ^  does  not  contribute  significantly  in  equation  7.8. 

7.5.4.  Response  Time  of  a  Cluster  Recovery 

A  cluster  recovery  takes  place  after  a  cluster  fails.  In  section  7. 3. 3. 4,  we  discussed 
a  procedure  to  perform  such  a  recovery.  Estimating  MTTR  t  response  time  for 
cluster  recovery  is  quite  difficult.  As  explained  below,  we  will  calculate  an  optimistic 
estimate. 
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The  step  of  the  recovery  procedure  that  may  contribute  the  most  to  the  recovery 
time  is  step  5,  "send  the  data  to  the  destination  cluster."  This  step  can  be  divided  into 
three  substeps. 

Step  5a.  Read  the  data  from  the  source  clusters.  The  DF  is  the  same  as  the  number  of 
source  clusters;  so  the  recovery  data  will  be  read  from  the  disks  of  each  of  the 
source  clusters.  Because  we  assume  that  vertical  fragments  of  the  parts  of  rela¬ 
tions  are  stored  on  a  cluster,  it  is  possible  to  retrieve  data  from  all  the  disks  in 
a  cluster  in  parallel.  The  following  equations  give  the  time  to  read  the 
recovery  data. 


total  data  lost  due  to  cluster  failure,  DBC  =  2x 


NC 

DBC 


total  recovery  data  on  a  source  cluster,  DBR  = 


data  on  a  disk  in  source  cluster,  DBD  = 


number  of  pages  to  be  read  from  a  disk,  PG  = 

PS 

time  to  read  PG  pages  or  the  total  read  time,  RT  =  PGxI 

Step  5b.  Transmit  data  from  source  clusters  to  destination  cluster.  We  assume  that  the 
data  received  in  a  one  page  fetch  cycle  is  packeted  together  and  transmitted  to 
a  destination  cluster.  Transmission  time  includes  the  time  required  for  physi¬ 
cal  channel  transmission  and  communication  software  overhead.  Optimistically, 
we  assume  a  communication  software  overhead  factor  (CSF)  of  2.0.  The  time 
to  transmit  recovery  data  can  be  calculated  as  follows. 

packet  size,  PKS  =  ND  x  PS 

PKS 

time  to  transmit  a  packet,  TP  =  X  CSF 

c 

total  transmission  time,  TT  =  TPxPG 

Step  5c.  Write  data  on  the  destination  cluster.  The  time  to  perform  this  step  can  be 
calculated  as  in  step  5a. 

We  assume  that  as  far  as  possible,  the  system  will  read,  transmit  and  write  data  in 
parallel.  Thus,  if  I  <  TP  (i.e.,  the  time  to  read  a  page  is  less  than  the  time  to  transmit 
a  packet),  the  total  time  for  step  5  is  (I  +  TT  4-  I),  where  the  first  I  is  due  to  the  time 
to  read  a  page  and  the  second  I  is  due  to  the  time  to  write  the  data  received  in  a  packet 
in  parallel  on  all  the  disks  in  a  destination  cluster.  On  the  other  hand,  if  I  >  TP,  then 
the  total  time  for  step  5  is  (TR  +  TP  +  I). 

Table  7.8  summarizes  the  time  taken  by  various  steps  of  the  recovery  procedure. 
The  time  taken  for  other  steps  of  the  recovery  procedure  is  our  best  guesstimate. 
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Step  of  Recovery  Procedure 

Time  Taken 

1. 

Detect  failure 

2  sec 

2,3,4. 

Prepare  to  send  data 

10  sec 

5. 

Send  data 

max(2I  +  TT,  TR  +  TP  +  I) 

6,7. 

Update  dictionary  and  restart 

2  sec 

Table  7.8:  Factors  Contributing  to  MTTR 

7. 5. 4.1.  MTTR  Results 

As  in  case  of  MTTF 1  ^  ,  MTTR  l  ter  is  influenced  by  many  parameters.  Among 

the  important  parameters  are  size  of  the  database,  distribution  factor,  interconnect 
capacity  and  page  size. 

The  larger  the  size  of  the  database,  the  more  data  will  become  unavailable  when  a 
cluster  fails.  Thus  more  data  will  need  to  be  transferred  during  the  recovery  procedure 
especially  if  the  system  has  fewer  large  clusters.  Figure  7.10  shows  that  MTTR  ^  is 
significantly  larger  for  database  size  of  1  terabyte  as  compared  to  the  database  size  of 
100  gigabytes.  Mcst  of  the  MTTR  t  is  contributed  by  the  transmission  time  (TT). 

A  higher  DF  means  more  clusters  have  the  recovery  data  and  hence  less  recovery 
data  per  cluster  and  disk  in  a  source  cluster.  Thus  it  is  possible  to  have  more  parallel¬ 
ism  for  systems  with  a  higher  DF.  DF  =1  means  that  only  one  cluster  has  all  the 
second  copy  of  data  required  for  recovery,  DF  —  NC  /  2  means  on  average  half  of  the 
cluster  in  the  system  have  part  of  the  recovery  data,  and  DF  —  NC-1  means  all  the 
active  clusters  have  some  part  of  the  recovery  data.  Figure  7.11  shows  that  effect  of  the 
DF  is  very  significant.  Since  the  DF  depends  on  the  data  storage  scheme,  the  effect  of 
data  storage  scheme  on  MTTR  ^  is  very  significant. 

We  found  that  TP  >  I  in  most  cases.  Thus  the  transmission  time  is  more  dom¬ 
inant  than  read  time,  especially  for  large  clusters  (i.e.,  higher  ND).  Because  of  this, 
effect  of  interconnect  capacity,  C,  on  MTTR  ^  is  significant,  particularly  for  clusters 
with  A JD  ^  16.  See  Figure  7.12. 

The  larger  the  page  size,  the  more  recovery  data  is  read  in  a  page  fetch  cycle.  This 
results  in  smaller  I/O  read  and  write  time  and  comparatively  larger  transmission  time 
per  packet.  For  smaller  clusters,  the  read  time  is  dominant  and  since  large  page  size 
results  in  fewer  page  fetches,  a  system  with  larger  PS  has  a  smaller  MTTR  See 

Figure  7.13.  For  ND  s  64,  transmission  time  dominates;  so  the  PS  does  not  affect 
MTTR  .  .  . 

cluster 


7.5.5.  System  Availability 


[  The  mean  time  to  a  cluster  failure  in  a  system  is  NC  *  MTTF eluater-  The  mean 

(  time  to  repair  from  such  a  failure  is  MTTR  ,  .  Thus  the  system  availability  can  be 

I  cluster 

i  calculated  as  follows  using  Equation  (1). 
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7. 5. 5.1.  Availability  Results 

To  study  the  effect  of  various  parameters  on  a  system,  we  evaluate  same-size  sys¬ 
tems.  We  do  this  by  fixing  the  total  number  of  disk  units  in  a  system  to  1024  (i.e.,  NC 
*  ND  =  1024).  Thus  if  the  system  has  large  clusters  (i.e.,  ND  is  large),  then  the  system 
has  fewer  such  clusters.  However,  we  will  need  fewer  large  clusters  in  a  system  for  a 
given  database  size.  Depending  on  the  capacity  of  each  of  the  disk,  the  total  database 
size  can  be  varied  from  100  gigabytes  to  1  terabyte.  This  size,  in  our  opinion,  defines  a 
very  large  database  system. 

System  availability  is  affected  by  all  the  parameters  that  affect  MTTF  ^  and 
MTTR  ,  .  The  parameters  that  have  more  significant  effect  are  system  configuration, 

cluster 

MTTF,.  , ,  NP,  DBS,  DF,  C  and  PS. 

disk 

Table  7.9  relates  Av  to  system  down  time.  The  desired  level  of  availability  depends 
on  the  type  of  system  application.  We  feel  that  modern  systems  will  be  expected  to 
give  availability  of  0.999  or  more.  It  is  comforting  to  note  that  the  fault  tolerance  tech¬ 
niques  we  already  know  may  be  able  to  give  such  availability  even  for  very  large  sys¬ 
tems.  It  may  be  noted  that  all  the  Av  curves  that  follow  are  drawn  to  logarithmic  scale. 
Thus  towards  the  higher  end  of  the  scale,  even  small  vertical  separation  of  curves  may 
mean  very  significant  differences  in  Av. 


0.99 

0.999 

0.9999 

0.99999 

0.99999 


_ Unavailability  of  the  System _ 

Unavailable  for  one  hour  in  4  days 

Unavailable  for  one  hour  in  41  days  (more  than  a  month) 
Unavailable  for  one  hour  in  416  days  (more  than  a  year) 
Unavailable  for  one  hour  in  4,166  days  (more  than  11  years) 
unavailable  for  one  hour  in  41,666  days  (more  than  114 
years) _ 

Table  7.9:  Availability  vs  Unavailability 


Figure  7.14  compares  the  Av  of  different  configurations  This  shows  that  the  effect 
of  system  configuration  (i.e.,  the  hardware  fault  tolerance  techniques)  is  very  significant. 
For  example,  a  64-cluster  system  with  8  disk  units  each  will  be  unavailable  for  one  hour 
in  approximately  79  days  if  it  uses  configuration  1,  unavailable  for  1  hour  in  approxi¬ 
mately  767  days  (2.1  years)  if  configuration  2  is  used,  and  unavailable  for  1  hour  in 
1250K  hours  (143  years)  if  configuration  3  is  used. 

Figure  7.15  compares  Av  for  systems  with  different  MTTF ^  ^  Since  MTTF is 
always  the  bottleneck  for  configurations  1  and  3  and  usually  the  bottleneck  for 
configuration  2,  the  effect  of  this  parameter  is  very  significant.  It  should,  however,  be 
noted  that  because  both  configuration  2  and  3  use  disk  mirroring,  the  improved  fault 
tolerance  results  in  a  100%  increase  in  disk  costs. 

Figure  7.16  shows  the  effect  of  NP  on  Av.  This  figure  shows  the  curves  for 
configuration  2,  in  which  MTTF  is  the  bottleneck  for  NP  =  1.  Thus  the  effect  of 
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processor  redundancy  is  significant  in  this  case.  However  in  all  other  cases,  because  disk 
availability  is  the  bottleneck  for  A  v,  the  effect  of  processor  redundancy  is  not  significant 
if  NP  s  2.  By  comparing  the  curves  for  NP  =  2  and  NP  =  ND  /  2,  we  also  note  that 
the  effect  of  more  than  two  processor  on  fault  tolerance  is  insignificant.  Thus,  the  pri¬ 
mary  purpose  of  having  more  than  two  processors  in  a  cluster  will  be  efficiency  in  query 
processing  and  not  fault  tolerance.  Also  note  that  processor  redundancy  costs  less  than 
disk  redundancy. 

Figure  7.17  shows  the  effect  of  DBS  on  At/.  The  difference  between  DBS  of  100 
gigabytes  and  1  terabyte  is  less  when  the  system  consists  of  many  small  clusters  because 
of  the  smaller  difference  between  values  for  MTTR  .  ,  .  This  difference  is  very 

cluster 

significant  for  the  system  consisting  of  fewer  large  cluster  for  both  MTTR c^uster  (see  Fig- 
ure  7.10)  and  Av. 

Figure  7.18  shows  the  effect  of  DF  on  Av.  A  higher  DF  means  more  clusters  have 
the  recovery  data  and  take  part  in  recover  procedure,  thus  providing  more  parallelism. 
This  effect  is  more  significant  for  a  system  consisting  of  a  large  number  of  small  clus¬ 
ters. 

Figure  7.19  shows  the  effect  of  C  on  Av.  The  differences  for  a  system  of  many 
small  clusters  is  insignificant  because  the  time  to  read  recovery  data  is  dominant  (i.e., 
the  system  is  node  bound).  However,  the  transmission  time  is  dominant  (i.e.,  the  system 
is  communication  bound)  for  a  system  with  few  large  clusters,  so  the  effect  of  C  is  more 
significant. 

Figure  7.20  shows  the  effect  of  PS  on  Av.  The  effect  is  more  significant  for  a  sys¬ 
tem  with  many  small  clusters.  In  general,  Av  is  higher  for  higher  PS.  A  higher  PS 
means  fewer  pages  need  to  be  fetched  and  larger  chunks  of  data  need  to  be  transmitted. 
This  results  in  the  system  becoming  communication  bound  sooner. 

In  general,  we  can  observe  that  systems  with  many  small  cluster  have  better  availa¬ 
bility  and  fault  tolerance. 
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7.6.  Conclusions 


In  this  chapter,  we  identified  and  described  basic  hardware  and  software  fault 
tolerance  techniques  that  are  required  to  achieve  a  high  degree  of  fault  tolerance  in  a 
very  large  database  system.  We  also  develop  quantitative  methods  to  evaluate  availabil¬ 
ity.  The  quantitative  evaluation  helps  us  to  understand  the  relative  importance  of  vari¬ 
ous  parameters  that  affect  system  fault  tolerance. 

W'e  use  system  availability  as  the  main  parameter  to  measure  fault  tolerance.  Our 
study  shows  that  system  availability  is  very  sensitive  to  the  following  parameters: 

1.  System  Architecture.  By  varying  the  number  of  clusters,  the  number  of  disks  per 
cluster,  and  the  number  of  processors  per  cluster,  we  were  able  to  study  different 
system  architectures.  For  example,  one  extreme  of  our  parameterized  architecture 
presents  a  loosely  coupled,  non-shared  memory  architecture  (i.e.,  case  of  ND  ==  1, 
AT5  =  1).  We  found  that  better  availability  is  obtained  when  the  system  has  many 
small  clusters.  In  most  cases.  A ’D  <  8  when  availability  peaks. 

2.  Fault  Tolerance  Techniques.  We  studied  the  effect  of  various  hardware  fault  toler¬ 
ance  techniques  by  varying  system  configurations  that  differ  in  the  fault  tolerance 
techniques  used  for  different  components.  We  found  that  the  effect  of  fault  toler¬ 
ance  techniques  on  system  availability  is  very  significant.  We  also  found  that  relia¬ 
bility  of  a  disk  (MTTF  )  is  usually  a  bottleneck,  and  performance  gain  due  to 
disk  mirroring  is  very  substantial.  Processor  redundancy  significantly  helps  if  disks 
are  mirrored  (or  other  disk  redundancy  methods  are  used).  However,  NP  =  2  is 
sufficient  for  fault  tolerance.  A  higher  A rP  does  not  improve  availability 
significantly.  Fault  tolerance  techniques  for  other  components  are  useful  in  con¬ 
junction  with  fault  tolerance  techniques  for  disks  and  processors. 

3.  Database  Size.  The  size  of  the  database  has  a  significant  effect  on  system  availabil¬ 
ity.  If  the  database  is  larger,  more  data  will  be  lost  when  a  hard  failure  occurs;  so 
recovery  takes  longer.  This  degrades  availability.  Also,  as  the  database  size 
increases,  the  system  will  require  more  components  of  given  capacity  for  storage 
and  efficient  processing.  This  can  have  a  very  significant  affect  on  system  availabil¬ 
ity. 

4.  Component  Reliability  and  Capacity.  Since  disk  reliability  is  usually  the 
bottleneck,  we  studied  availability  for  disks  with  two  different  MTTF  values. 


Higher  MTTF  improves  availability  significantly.  System  availability  will  also 
improve  significantly  if  disks  with  higher  capacities  are  used  (provided  MTTF  ^  of 


a  high  capacity  disk  is  not  much  lower  than  that  for  a  low  capacity  disk).  We  also 
studied  the  effect  of  using  interconnects  with  different  capacities.  A  system  with  a 
higher  capacity  interconnect  has  a  significantly  better  fault  tolerance  when  the  sys¬ 
tem  consists  of  a  few  large  clusters. 

Data  Storage  and  Access  Method.  We  studied  only  a  limited  aspect  of  this  issue. 
We  used  one  basic  data  storage  method  in  our  study.  By  varying  the  distribution 
factor,  we  were  able  to  study  how  a  data  storage  method  can  affect  recovery  time 
and  hence  system  availability.  By  showing  the  dependence  of  the  quantitative 
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evaluation  on  the  data  storage  method  and  by  using  a  distribution  factor,  the 
study  shows  that  the  data  storage  method  greatly  affects  system  availability.  We 
studied  the  access  method  only  with  respect  to  page  size.  A  larger  page  size  helps 
significantly  in  a  system  with  many  small  clusters. 
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CHAPTER  8 


D/KBMS  Architecture  Specification  Methodology 


Phase  I  of  the  VLPDF  contract  involved  six  investigation  studies:  (l)  alternative 
D/KBMS  application  interface  languages,  (2)  parallel  architectures  for  such  languages, 
(3)  D/KB  query  processing,  (4)  transitive  closure  algorithms,  (5)  parallel  D/KBMS  archi¬ 
tectures,  and  (6)  fault  tolerance  in  very  large  database  systems.  During  Phase  II,  the 
results  of  these  studies  were  used  to  develop  a  methodology  for  specifying  high  perfor¬ 
mance,  highly  available  data/knowledge  base  management  systems  for  very  large 
data/knowledge  base  environments.  This  chapter  presents  this  methodology. 

The  methodology  is  presented  as  a  set  of  policies  and  steps.  The  policies  constitute 
a  method  of  action  selected  from  among  alternatives  to  guide  and  determine  D/KBMS 
design  decisions.  They  are  actually  a  philosophical  statement  of  intent  to  guide  the 
D/KBMS  designer  in  making  suitable  design  decisions,  rather  than  a  comprehensive 
recipe  for  design. 

Two  sets  of  steps  are  presented,  the  first  representing  a  recipe  a  recipe  for  D/KB 
query  and  update  processing,  and  the  second,  an  overall  procedure  for  D/KBMS  archi¬ 
tecture  specification.  The  steps  for  D/KBMS  architecture  specification  are  not  intended 
to  be  comprehensive.  Where  gaps  exist,  the  D/KBMS  designer  should  consult  the  poli¬ 
cies  to  determine  an  appropriate  course  of  action. 

The  policies  are  grouped  under  10  categories: 

1.  policies  regarding  overall  D/KBMS  functionality, 

2.  knowledge  representation  policies, 

3.  rule  storage  policies, 

4.  D/KB  query  processing  policies, 

5.  D/KB  update  processing  policies, 

6.  D/KBMS  functional  partitioning  policies, 

7.  LFP  evaluation  policies, 

8.  join  processing  policies, 

9.  D/KBMS  hardware  architectural  policies,  and 

10.  fault  tolerance  policies. 

These  categories  represent  the  critical  issues  in  the  design  of  high  performance,  highly 
available  D/KBMSs  for  very  large  D/KB  environments.  Several  alternatives  are  avail¬ 
able  for  addressing  each  of  these  issues.  These  alternatives  have  a  wide  ranging  perfor¬ 
mance  impact.  For  example,  previously  published  performance  results  indicate  6  to  8 
orders  of  magnitude  difference  in  performance  between  certain  LFP  evaluation  stra¬ 
tegies.  The  policies  presented  in  this  chapter  are  intended  to  guide  the  D/KBMS 
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designer  in  the  choice  of  suitable  alternatives  for  each  critical  issue  listed  above. 

The  chapter  is  organized  as  follows.  Sections  8.1  through  8.10  present  the  policies. 
Section  8.11  describes  the  steps  in  D/KB  query  and  update  processing.  Section  8.12 
presents  the  overall  procedure  for  D/KBMS  architecture  specification. 

8.1.  Policies  Regarding  Overall  D/KBMS  Functionality 

To  motivate  the  overall  D/KBMS  functions,  let  us  sketch  out  a  typical  user  session 
with  the  D/KBMS.  The  user  first  enters  a  set  of  rules  and  facts.  These  rules  and  facts 
are  stored  in  a  memory  resident  private  environment  called  the  Workspace  D/KB.  The 
Workspace  D/KB  rules  may  refer  to  rules  and  facts  stored  in  a  shared  disk  resident 
repository  called  the  Stored  D/KB.  The  rules  in  the  Stored  D/KB  may  also  refer  to 
rules  and  facts  in  the  Workspace  D/KB.  After  entering  a  set  of  rules  and  facts  into  the 
Workspace  D/KB,  the  user  issues  queries  against  them.  If  he  is  satisfied  that  the  rules 
and  facts  in  the  Workspace  D/KB  are  correct,  he  updates  the  Stored  D/KB  with  these 
rules  and  facts. 

With  this  background,  we  list  the  overall  D/KBMS  functions.  The  D/KBMS  shall 
provide  five  basic  functions:  (1)  provide  knowledge  representation  and  modeling  capa¬ 
bilities,  (2)  enter  rules  and  facts  into  the  Workspace  D/KB,  (3)  enter  queries,  (4)  execute 
queries,  and  (5)  update  the  Stored  D/KB  with  rules  and  facts  from  the  Workspace 
D/KB. 

8.2.  Knowledge  Representation  Policies 

Knowledge  representation  is  a  basic  capability  expected  of  a  D/KBMS.  This  sec¬ 
tion  presents  policies  relating  to  knowledge  representation. 

•  The  basic  level  of  knowledge  representation  capability  provided  by  a  D/KBMS 
shall  be  Horn  clause  logic.  The  reason  for  this  is  that  logic  offers  several  advan¬ 
tages: 

a.  it  provides  a  uniform  formalism  for  data,  rules,  views,  and  integrity  con¬ 
straints; 

b.  it  is  the  basis  for  relational  database  theory; 

c.  it  is  amenable  to  parallel  processing; 

d.  it  is  an  adequate  basis  for  implementing  other  knowledge  representations;  and 

e.  it  has  a  sound  theoretical  foundation,  which  permits  the  abstract  expression  of 
ideas,  independent  of  their  implementation. 

•  The  data/knowledge  base  shall  consist  of  a  set  of  Horn  clauses  and  schemas.  See 
chapter  4  for  the  concepts  and  definitions  pertaining  to  such  a  data/knowledge 
base. 
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8.3.  Rule  Storage  Policies 

The  choice  of  rule  storage  structures  is  a  critical  one  since  it  affects  the  time  taken 
to  extract  the  relevant  rules  from  the  Stored  D/KB  during  query  processing.  This  sec¬ 
tion  presents  a  set  of  policies  relating  to  rule  storage  structures. 

•  The  basic  rule  storage  structures  in  the  Stored  D/KB  shall  consist  of  three  rela¬ 
tions:  isystables,  isyscolumns,  and  irulesource.  These  storage  structures  are  basi¬ 
cally  "source  form"  storage  structures,  in  that  they  contain  a  direct  representation 
of  the  source  form  of  the  rules. 

The  first  two  relations,  isystables  and  isyscolumns,  shall  contain  the  names  and 
column  types  of  the  derived  predicates,  respectively.  These  tables  shall  have  the 
following  schema: 

isystables(tablename  char,  tableid  integer ) 

isyscolumns (tableid  integer,  colname  char,  colnumber  integer,  coltype  integer) 

irulesource  shall  store  for  each  derived  predicate  p,  the  rules  defining  p,  and  shall 
have  the  following  schema: 

irulesource  (headpredname  char,  rule  char) 

•  For  update  intensive  applications,  the  above  source  form  storage  structures  shall 
suffice.  However,  for  query  intensive  applications,  there  there  shall  be  a  "compiled 
form"  storage  structure  (described  below),  in  addition  to  the  basic  rule  storage 
structures.  The  motivation  for  this  policy  is  that  with  compiled  form  storage  struc¬ 
tures,  queries  can  be  processed  faster,  but  updates  take  longer. 

•  The  compiled  form  storage  structure  for  query  intensive  applications  shall  be  a 
relation  called  ireachablepreds.  This  relation  shall  be  the  transitive  closure  of  the 
PCG  of  the  rules  stored  in  irulesource.  That  is,  it  shall  store  for  each  derived 
predicate  p  all  the  predicates  reachable  from  p.  It  shall  have  the  following  schema: 

ireachablepreds  [f  rompredname  char,  topredname  char). 

The  motivation  for  storing  the  transitive  closure  of  the  PCG  is  that  using  this 
storage  structure,  the  time  to  extract  the  relevant  rules  can  be  made  independent 
cf  the  total  number  of  rules  in  the  Stored  D/KB.  If  the  transitive  closure  is  not 
stored,  it  would  have  to  be  computed  during  query  processing  and  the  time  for 
doing  this  increases  with  the  number  of  rules. 

8.4.  D/KB  Query  Processing  Policies 

This  section  presents  policies  relating  to  D/KB  query  processing.  These  policies 
constitute  a  broad  D/KB  query  processing  strategy;  the  specific  policies  relating  to  high 
performance  query  execution  are  presented  in  sections  8.7  through  8.9. 
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•  D/KB  query  processing  shall  consist  of  two  phases:  query  compilation  and  query 
execution.  The  motivation  for  splitting  up  query  processing  is  twofold.  First, 
several  optimizations  can  be  performed  during  the  compilation  phase  and  second, 
frequently  occurring  queries  can  be  precompiled,  speeding  up  the  processing  of  such 
queries. 

•  During  compilation,  all  the  rules  needed  to  solve  the  query  shall  be  brought  into 
the  Workspace  D/KB.  In  general,  these  rules  will  be  present  in  both  the 
Workspace  and  Stored  D/KBs.  Bringing  all  of  them  into  the  Workspace  D/KB 
will  typically  involve  first  determining  from  the  existing  Workspace  D/KB  rules  all 
the  predicates  reachable  from  the  query  and  then  extracting  from  the  Stored  D/KB 
the  rules  needed  to  solve  these  predicates. 

•  After  all  the  relevant  rules  are  loaded  into  the  Workspace  D/KB,  the  PCG  of  these 
rules  shall  be  constructed  and  the  cliques  in  the  PCG  identified. 

•  The  cliques  shall  be  ordered  in  such  a  way  that  a  clique  evaluation  begins  only 
after  all  its  body  predicates  have  been  evaluated. 

•  The  type  of  the  each  column  of  a  base  predicate  is  fixed  at  the  time  it  is  created. 
The  type  of  the  columns  of  the  derived  predicates  shall  be  inferred  from  the  rules. 
For  example,  in  the  rule  p( X,  Y)  -  b( X,  F),  the  type  of  the  first  (respectively, 
second)  column  of  p  is  the  same  as  that  of  the  first  (respectively,  second)  column  of 
b.  Type  checking  shall  infer  the  types  of  the  derived  predicates  and  also  check 
whether  the  same  types  are  inferred  from  all  the  rules  defining  p. 

•  The  compiled  query  shall  consist  of  either  a  standard  relational  query  (when 
evaluating  a  non-recursive  D/KB  query)  or  an  ordered  list  of  LFP  queries  (one  for 
each  clique  that  is  to  be  solved  when  evaluating  a  recursive  D/KB  query).  An  LFP 
query  is  a  query  that  takes  a  set  of  recursive  equations  of  the  form, 
r-  =  ft{rv  •  ••,  rn),  i  =  1,  ...,  n,  as  input  and  computes  their  least  fixed  point, 
thereby  solving  each  r  .  The  /.’ s  are  relational  algebra  expressions. 

•  Database  (as  opposed  to  rule  base)  queries  shall  be  processed  using  traditional  rela¬ 
tional  compilation  and  optimization  techniques. 

•  A  bottom-up  strategy  (see  section  4.4)  shall  be  adopted  for  rule  base  query  process¬ 
ing.  This  is  because  bottom-up  strategies  are  simpler  and  easy  to  implement. 
Bottom-up  evaluation  of  a  non-recursive  predicate  shall  be  done  using  traditional 
relational  compilation  and  optimization  techniques  (see  section  4.5).  Bottom-up 
evaluation  of  a  recursive  predicate  involves  computing  the  LFP  of  a  set  of  recursive 
equations.  Policies  for  LFP  evaluation  are  presented  in  section  8.7. 

8.5.  D/KB  Update  Processing  Policies 

This  section  presents  policies  that  relate  to  updating  the  Stored  D/KB  with  rules 

and  facts  from  the  Workspace  D/KB. 

•  During  updates,  the  D/KBMS  shall  ensure  that  after  the  update,  ireachabtepreds  is 
the  transitive  closure  of  the  PCG  of  the  rules  in  irulesource. 
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•  During  updates,  the  D/KBMS  shall  ensure  that  the  Workspace  D/KB  rules  do  not 
cause  the  type  of  any  Stored  D/KB  derived  predicate  to  change.  If  this  is  likely  to 
happen,  the  update  shall  be  rejected. 

8.6.  D/KBMS  Functional  Partitioning  Policies 

This  set  of  policies  relates  to  the  functional  components  of  a  D/KBMS  and  their 

interfaces. 

•  The  D/KBMS  shall  be  partitioned  into  two  layers,  the  Knowledge  Manager  (KM) 
and  the  Data  Base  Management  System  (DBMS),  with  the  KM  at  the  top  and  the 
DBMS  at  the  bottom.  The  KM  provides  the  interface  to  the  D/KBMS  from  the 
outside  world. 

The  basis  for  this  partitioning  is  the  division  of  D/KB  query  processing  into  compi¬ 
lation  and  execution  phases.  The  KM  and  the  DBMS  correspond  respectively  to 
these  phases:  the  KM  is  the  D/KB  query  compiler,  while  the  DBMS  is  the  compiled 
query  execution  engine. 

The  KM  shall  be  responsible  for  query  parsing,  relevant  rule  extraction,  magic  set 
optimization,  clique  identification  and  ordering,  and  semantic  checking.  The 
DBMS  shall  be  responsible  for  evaluating  the  cliques  and  non-recursive  predicates 
as  per  the  order  prescribed  by  the  KM. 

•  The  KM/DBMS  interface  shall  be  relational  algebra  augmented  with  a  general  LFP 
operator  and  one  or  more  specialized  LFP  operators.  This  interface  determines  the 
allocation  of  functions  above  and  below  the  KM/DBMS  layer  boundary,  and  there¬ 
fore,  is  a  key  D/KBMS  design  issue. 

The  motivation  for  including  relational  algebra  in  this  interface  is  the  following. 
First,  there  is  a  close  match  between  logic  and  relational  algebra  and  this  makes 
relational  algebra  an  attractive  starting  point  for  Horn  clause  query  processing  sup¬ 
port.  Second,  the  set  oriented  nature  of  relational  algebra  makes  powerful,  non¬ 
procedural  query  languages  possible.  Third,  operations  on  sets  and  relations  are 
inherently  parallel. 

The  reason  for  augmenting  relational  algebra  with  a  general  LFP  operator  is  that  it 
is  not  possible  to  express  LFP  queries  (or  recursive  queries)  using  relational  algebra 
alone.  Such  queries  arise  when  a  clique  of  mutually  recursive  predicates  is  to  be 
evaluated.  Since  relational  algebra  cannot  express  LFP  queries,  cliques  must  be 
evaluated  via  an  application  program  generated  by  the  Knowledge  Manager  when 
using  relational  algebra  as  the  KM/DBMS  interface.  There  is  not  much  scope  for 
the  KM  to  optimize  the  performance  of  this  application  program  since  the  informa¬ 
tion  needed  for  this  optimization  (join  selectivities,  intermediate  relation  sizes,  etc.) 
is  not  visible  to  the  KM  through  the  relational  algebra  interface.  On  the  other 
hand,  suppose  a  general  LFP  operator  is  included  in  the  KM/DBMS  interface, 
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which  accepts  a  set  of  recursive  equations  of  the  form,  r  =  /f(r  rn), 
t  =  1,  n,  as  input  and  computes  their  least  fixed  point,  thereby  solving  each  r  . 
Then  the  KM  need  not  generate  an  application  program;  it  can  simply  generate 
LFP  queries  and  the  DBMS  can  figure  out  an  efficient  way  to  execute  them  (several 
policies  for  enhancing  the  performance  of  LFP  evaluation  are  outlined  in  section 
8.5). 

The  reason  for  including  one  or  more  specialized  LFP  operators  in  the  KM/DBMS 
interface  is  that  it  may  be  possible  to  optimize  the  execution  of  certain  special 
operators  better  than  that  of  a  general  LFP  operator.  Since  special  LFP  queries 
like  transitive  closure  are  expected  to  occur  frequently,  including  such  operators  in 
the  KM/DBMS  interface  and  implementing  them  efficiently  in  the  DBMS  can 
enhance  overall  performance. 

8.7.  LFP  Evaluation  Policies 

D/KB  query  execution  performance  is  very  significantly  affected  by  the  efficiency  of 

the  LFP  evaluation  strategy.  This  section  presents  policies  relating  to  LFP  evaluation. 

•  LFP  evaluation  shall  be  done  using  semi-naive  evaluation  (see  algorithm  4  in 
chapter  4).  The  motivation  for  choosing  semi-naive  evaluation  over  naive  evalua¬ 
tion  is  that  it  avoids  much  of  the  redundant  work  (i.e..  recomputing  tuples  in  an 
iteration  that  were  computed  in  the  previous  one)  of  the  latter. 

•  While  simpler  and  easy  to  implement,  bottom-up  strategies  compute  a  lot  of  useless 
results,  since  they  do  not  use  knowledge  about  the  query  to  restrict  the  search 
space.  To  overcome  this  problem,  the  KM  shall  rewrite  the  relevant  rules  using  the 
generalized  magic  set  optimization  algorithm  (see  section  4.4)  into  an  equivalent  set 
of  rules  whose  bottom-up  evaluation  is  more  efficient.  The  policy  of  combining 
semi-naive  evaluation  with  the  generalized  magic  set  optimization  algorithm 
addresses  the  inefficiency  problem  of  bottom-up  strategies,  while  at  the  same  time 
retaining  their  ease  of  implementation  advantage. 

•  To  enhance  LFP  evaluation  performance,  a  dynamically  adaptable  indexing  stra¬ 
tegy  shall  be  used  to  speed  up  the  evaluation  of  the  right  hand  side  of  the  recursive 
equations  or  their  differential.  This  strategy  shall  dynamically  create  and  drop 
temporary  indexes  on  the  base  and  intermediate  derived  relations  depending  on 
their  relative  sizes. 

•  During  LFP  evaluation,  the  join  strategy  shall  be  dynamically  changed  between 
iterations  if  necessary,  depending  on  the  sized  of  the  base  and  intermediate  derived 
relations  and  the  join  selectivities  from  the  previous  iterations. 

•  Parallel  and  pipelined  processing  techniques  shall  be  employed  during  LFP  evalua¬ 
tion.  These  include  evaluating  the  right  hand  side  of  each  recursive  equation  in 
parallel  and  pipelining  and  data  flow  techniques  for  evaluating  the  relational  alge¬ 
bra  tree  corresponding  to  the  right  hand  side  of  these  equations. 
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8.8.  Join  Processing  Policies 

The  efficiency  of  the  join  operation  has  a  significant  impact  on  D/KB  query  execu¬ 
tion  performance.  This  is  because  during  recursive  query  execution,  evaluation  of  a 

clique  generally  involves  numerous  join  operations.  This  section  presents  policies  relat¬ 
ing  to  join  processing. 

•  The  DBMS  shall  employ  the  parallel  and  pipelined  join  algorithms  described  in 
chapter  6. 

•  The  query  optimizer  shall  choose  an  appropriate  algorithm  for  a  particular  join 
operation  and  system  configuration.  It  shall  carefully  determine  the  number  of 
clusters,  the  number  of  disks,  and  the  number  of  processors  that  will  be  used  in  the 
join. 

•  In  most  cases,  the  query  optimizer  shall  choose  a  hash-based  join  algorithm.  This 
is  because  hash-based  join  algorithms  are  naturally  parallelizable  and  indeed,  the 
performance  evaluation  reported  in  chapter  8  confirms  the  relative  superiority  of 
hash-based  join  algorithms  over  sort-merge  join  algorithms.  However,  in  cases 
where  the  output  is  required  in  sorted  order  or  the  source  relations  are  already 
sorted,  sort-merge  based  join  algorithms  shall  be  preferred. 

8.9.  D/KBMS  Hardware  Architectural  Policies 

This  section  presents  policies  relating  to  the  D/KBMS  hardware  architecture. 

•  The  KM  and  the  DBMS  shall  be  resident  on  different  pieces  of  hardware,  the  KM 
on  a  general  purpose  workstation  and  the  DBMS  on  a  special  purpose  parallel  rela¬ 
tional  database  machine.  The  DBMS  software  and  the  parallel  database  machine 
shall  constitute  a  data/knowledge  base  server.  The  overall  environment  shall  be  a 
local  area  network  in  which  several  workstations  running  the  KM  send  LFP  queries 
to  this  server. 

•  The  DBMS  shall  be  a  multiprocessor  relational  database  machine  to  get  the 
requisite  levels  of  performance.  The  DBMS  architecture  shall  be  parameterized  in 
the  degree  of  memory  sharing,  so  that  tightly  coupled,  loosely  coupled,  and  inter¬ 
mediate  architectures  can  be  obtained.  It  shall  use  a  large  number  (tens  to 
hudreds,  at  least)  of  processors  to  obtain  the  necessary  performance,  a  large 
amount  (hundreds  of  megabytes  to  hundreds  of  gigabytes)  of  semiconductor 
memory,  and  shall  support  an  aggregate  disk  capacity  of  a  terabyte  of  more.  It 
shall  be  constructed  principally  from  commodity  components  such  as  general  pur¬ 
pose  microprocessors  and  conventional  disk  storage  devices,  since  commodity  com¬ 
ponents  have  superior  price/performance  and  reliability  compared  to  custom  com¬ 
ponents. 

•  The  architecture  shall  consist  of  a  set  of  clusters  linked  by  an  intercluster  bus  or 
ring.  Each  cluster  shall  consist  of  a  set  of  processors,  a  shared  memory  bank 
addressable  by  all  the  processors  in  the  cluster,  and  a  set  of  disk  storage  units  and 
associated  controllers.  The  processors  may  have  local  caches  to  reduce  memory 
contention,  but  this  shall  be  invisible  to  the  data  management  software  except  for 
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possibly  the  need  to  flush  the  cache  occasionally.  Transfers  between  disk  and 
memory,  and  between  cluster  memories  over  the  bus,  shall  be  a  page  at  a  time, 
where  a  page  is  a  few  kilobytes  or  more  in  size.  A  specific  configuration  of  this 
architecture  shall  be  determined  by  the  following  parameters:  number  of  clusters, 
number  of  processors  per  cluster,  number  of  disks  per  cluster,  pages  of  main 
memory  available  per  cluster,  and  page  size  in  bytes.  These  range  of  values  for 
these  parameters  shall  be  such  that  a  wide  range  of  performance  and  D/KB  size 
requirements  are  accommodated. 

The  effect  of  system  configuration  on  join  performance  is  significant.  A 
configuration  with  a  few  large  clusters  shall  be  preferred  to  one  with  many  small 
dusters.  In  fact,  our  investigation  showed  that  a  very  large  single  cluster  provides 
the  best  performance,  since  the  communication  cost  is  eliminated. 

A  system  with  many  clusters  shall  be  preferred  to  one  with  fewer  clusters  of  the 
same  size.  This  variation  increases  the  parallel  processing  power  of  the  system  and 
speeds  up  the  data  transfer  rate  for  some  algorithms  because  the  data  is  distri¬ 
buted  better  in  the  system. 


The  number  of  disks  at  each  cluster  shall  be  increased  if  the  disk  I/O  proves  to  be 
a  bottleneck. 


Except  for  systems  employing  processing  intensive  join  algorithms,  the  number  of 
processors  at  each  cluser  shall  not  be  increased  as  a  means  of  enhancing  perfor¬ 
mance.  This  is  because  the  CPU  processing  speed  is  not  the  bottleneck  in  most 


The  effect  of  disk  I/O  speed  and  network  transfer  rates  on  join  algorithm  perfor¬ 
mance  is  significant.  Therefore,  the  values  of  these  components  shall  be  adjusted 
to  achieve  the  required  levels  of  cost/performance.  If  disk  I/O  still  becomes  a 
bottleneck,  other  solutions  such  as,  using  a  large  page  size,  using  a  large  disk- 
memory  transfer  rate,  increasing  the  number  of  disks  at  each  cluster,  and  increas¬ 
ing  number  of  clusters,  shall  be  employed. 


8.10.  Fault  Tolerance  Policies 


Fault  tolerance  is  a  system’s  ability  to  tolerate  faults  in  its  components.  Fault 
tolerance  techniques  make  a  system  tolerate  faults,  and  in  case  of  failures  that  cannot 
be  tolerated,  allow’  the  system  to  degrade  gracefully  (i.e.,  allow  partial  operation).  The 
parameter  most  often  used  to  measure  the  fault  tolerance  of  a  system  is  its  availability, 
which  is  the  percentage  of  time  a  system  performs  according  to  its  specifications  (i.e.,  is 
available  to  do  useful  work).  This  section  presents  policies  for  ensuring  high  D/KBMS 
availability  in  very  large  D/KB  environments. 

•  The  D/KBMS  shall  employ  the  hardware  fault  tolerance  techniques  described  in 
section  7.3.2  to  tolerate  failures  of  D/KBMS  hardware  components  affecting  availa¬ 
bility.  These  components  are:  disks,  processors,  memory,  interconnect,  and  power 
supply.  Failure  of  a  sixth  component,  the  intra-cluster  bus,  is  very  infrequent  and 
can  be  considered  as  manifesting  itself  as  a  failure  of  one  of  the  other  five 
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components.  The  hardware  fault  tolerance  techniques  introduce  redundancy  at  the 
component  level  so  that  component  failures  can  be  tolerated  without  affecting  sys¬ 
tem  operation. 

The  D/KBMS  shall  employ  the  software  fault  tolerance  techniques  described  in  sec¬ 
tion  7.3.3  to  tolerate  failures  affecting  the  major  software  components  of  the 
D/KBMS.  These  components  are:  transaction  management  and  operating  system 
software. 

From  a  fault  tolerance  point  of  view,  architectures  with  small  clusters  (i.e.,  less 
than  8  disks  per  cluster)  and  a  low  degree  of  memory  sharing  shall  be  preferred  to 
architectures  with  large  clusters  and  a  high  degree  of  memory  sharing,  as  the 
former  have  better  fault  tolerance.  However,  exactly  the  opposite  is  true  from  the 
query  processing  point  of  view.  The  choice  depends  on  the  application  require¬ 
ments.  If  fault  tolerance  is  more  important,  a  large  number  of  small  clusters  shall 
be  used,  while  if  query  processing  performance  is  more  important,  a  few  large  clus¬ 
ters  shall  be  used. 

The  D/KBMS  shall  employ  components  with  high  individual  reliability  values  (par¬ 
ticularly  disks)  and  high  capacity  (particularly  disks  and  interconnect)  as  the  Phase 
I  investigation  showed  that  such  components  have  a  significant  positive  effect  on 
D/KBMS  availability. 

The  D/KBMS  shall  employ  data  storage  methods  with  high  replication  and  high 
distribution  factors. 


The  D/KBMS  shall  employ  access  methods  that  allow  large  amount  of  parallelism 
Larger  page  sizes  shall  be  preferred  in  a  D/KBMS  with  many  small  clusters. 

Query  processing  and  optimization  techniques  and  system  fault  tolerance  are  very 
intimately  related.  A  comparison  of  the  results  of  our  parallel  join  processing  and 
fault  tolerance  investigations  present  interesting  but  difficult  trade-offs.  The  join 
processing  investigation  shows  that  algorithms  perform  better  in  a  tightly  coupled 
(single  cluster)  architecture  since  the  communications  cost  is  eliminated.  However, 
the  fault  tolerance  of  such  an  architecture  is  the  worst  since  failure  of  one  com¬ 
ponent  unit,  such  as  the  shared  memory,  can  stop  the  complete  system.  In  another 
example,  consider  the  trade-offs  with  respect  to  the  data  storage  scheme.  The 
query  processing  techniques  and  the  data  storage  schemes  depend  on  the  type  and 
frequency  of  queries  asked  (i.e.,  the  application).  As  we  saw  in  our  investigation, 
the  data  storage  scheme  affects  the  fault  tolerance  significantly.  Thus,  in  designing 
a  real  system,  the  designers  of  query  processing  software  and  fault  tolerance  have 
to  understand  the  application  and  devise  a  storage  scheme  that  is  acceptable  from 
the  query  processing  as  well  as  fault  tolerance  viewpoint.  Required  levels  of 
response  time,  fault  tolerance  and  reliability  are  application  dependent. 


8.11.  Steps  in  D/KB  Query  and  Update  Processing 

We  first  describe  the  steps  for  processing  queries  to  the  workspace  D/KB,  for  the 
case  where  the  rules  in  the  workspace  D/KB  do  not  refer  to  rules  in  the  stored  D/KB 


£<v:v$ 


(scenario  1).  We  then  describe  how  to  update  the  stored  D/KB  with  rules  and  facts 
from  the  workspace  D/KB.  Finally,  we  describe  workspace  and  stored  D/KB  query  pro¬ 
cessing  (scenario  2)  for  the  case  where  the  rules  in  the  workspace  D/KB  and  the  stored 
D/KB  may  refer  to  each  other. 

8.11.1.  Scenario  1 

Consider  the  following  rules  and  query. 

Ry  ancestor  (X ,  Y)  -  parent(X,  Y). 

/? 2:  ancestor(X ,  Y)  -  parent(X,  Z),  ancestor (Z ,  K). 

query (X)  -  ancestor  (" john" ,  X). 

1.  Add  the  query  rule  to  the  predicate  conr.action  graph.  The  PCG  then  is  as  shown 
in  figure  8.1. 


Figure  8.1.  PCG  with  Query  Rule  Added 

2.  Find  relevant  predicates  and  rules.  The  relevant  predicates  are  the  predicate 
reachable  from  the  query  node  in  the  PCG.  The  relevant  predicates  can  be 
obtained  by  computing  the  transitive  closure  of  the  PCG.  For  the  ancestor  query, 
the  relevant  predicates  are  ancestor  and  parent.  The  relevant  rules  are  those 
defining  derived  relevant  predicates.  For  the  ancestor  query,  the  relevant  rules  are 
R  j  and  R2.  We  also  add  query  to  the  set  of  relevant  predicates  and  the  query  rule 
to  the  set  of  relevant  rules. 
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3. 


4. 

R 


5. 

R. 


R 


6. 

7. 

8. 


Generate  the  adorned  versions  of  the  relevant  predicates  and  adorned  rules  defining 
these  predicates. 


R ancestor^  (X,  Y)  -  parent(X ,  )'). 


R2:  ancestor^ (X,  Y)  -  parent {X ,  Z),  ancestor”1  (Z ,  >'). 

query  ^  (X )  -  ancestor^ (" john" ,  X). 

Generate  magic  rules  and  modified  rules. 


V, 


magic_ancestorb^  (Z)  -  magic_ancestor01  (X),  parent(X,  Z) 


V, 


v 


R2:  magic_ancestor b ?  ("john" ). 


R3:  ancestor**1 (X ,  Y)  -  magic_ancestorDI  (X),  parent(X ,  Y). 


Vi 


R4:  ancestor**1  (X ,  Y)  -  magic _ancestor01  (X),  parent (X ,  Y),  ancestor111  (Z ,  Y). 


V, 


.V, 


query1  (A")  -  ancestor*31  ("  john" ,  X). 

Split  the  data/knowledge  base  into  its  intensional  and  extensional  components. 


magic  _anc  e  stor*3 1  (Z)  -  magic_ancestorDI  (X),  parent(X,  Z). 


V, 


R2:  magic_ancestorb!  (X)  -  base_maD1  (X). 


V, 


R3:  ancestor*31  (X ,  Y)  -  magic_ancestor°J  (X),  parent(X ,  Y). 


V, 


ancestor*31  (X,  Y)  -  magic_ancestor01  (X),  parent(X,  Y),  ancestor”1  (Z ,  1"). 

base_ma b 1  ("john" ). 

Construct  the  PCG  of  these  rules. 

Find  the  cliques  of  this  PCG  (see  figure  8.2). 

Construct  the  evaluation  graph.  This  is  a  directed  graph  whose  nodes  are  either 
derived  predicates  or  cliques.  There  can  be  four  types  of  directed  edges:  (1)  P  -  C 
indicates  that  some  predicate  in  the  clique  C  appears  in  the  body  of  a  rule  defining 
P,  (2)  C  -  P  indicates  that  P  appears  in  the  body  of  a  rule  defining  some  predi¬ 
cate  of  C,  (3)  P j  -  P2  indicates  that  P2  appears  in  the  body  of  a  rule  defining  Pv 
and  (4)  Cj  -  C2  indicates  that  some  predicate  of  C2  appears  in  the  body  of  a  rule 
defining  some  predicate  of  C v 

For  the  ancestor  query,  the  evaluation  graph  is  shown  in  figure  8.3.  The  evalua¬ 
tion  graph  is  essentially  the  PCG  with  the  base  predicates  removed  and  the 
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Figure  8.2:  Cliques  for  ancestor  query 


query 


cliquel 


clique2 


Figure  8.3.  Evaluation  Graph  for  ancestor  Query 

predicates  of  a  clique  collapsed  into  a  single  node.  Thus,  the  predicate  nodes  of  the 
evaluation  graph  are  the  nonrecursive  predicates  of  the  PCG.  Also,  it  should  be 
clear  that  while  the  PCG  may  be  cyclic,  the  evaluation  graph  is  acyclic. 

9.  Perform  a  topological  sort  of  the  evaluation  graph.  This  gives  a  total  order  indi¬ 
cating  the  order  in  which  the  nonrecursive  predicates  and  cliques  are  to  be 
evaluated.  The  order  is  such  that  a  rule  is  evaluated  only  after  the  predicates  in 
the  body  are  evaluated.  The  total  order  is  called  the  evaluation  order  list. 

For  the  ancestor  query,  the  total  order  is  C2,  Cl,  query. 

10.  Perform  semantic  checks.  In  this  work,  we  consider  two  kinds  of  checks.  The  first 
is  to  check  for  each  derived  relevant  predicate  whether  there  is  a  rule  defining  it. 
The  second  is  a  type  check.  The  type  of  each  column  of  a  base  predicate  is  fixed 
at  the  time  it  is  created.  The  type  of  the  columns  of  the  derived  predicates  is 
inferred  from  the  rules.  For  example,  in  the  rule  p(X,  Y)  -  b(X,  Y),  the  type  of 
the  first  (respectively,  second)  column  of  p  is  the  same  as  that  of  the  first  (respec¬ 
tively,  second)  column  of  b.  Type  checking  involves  inferring  the  types  of  the 
derived  predicates  and  also  checking  whether  the  same  types  are  inferred  from  all 
the  rules  defining  p.  This  is  easy  to  do  for  nonrecursive  predicates.  However,  for 
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recursive  predicates,  we  need  to  loop  till  either  closure  is  reached  or  there  is  a  type 
mismatch.  We  have  developed  a  type  checking  algorithm,  which  we  describe  in 
section  8.11.1.1. 

If  there  is  an  error  in  either  of  the  two  semantic  checks,  we  do  not  perform  the 
next  step. 

11.  Evaluate  the  cliques  and  nonrecursive  predicates  as  per  the  order  of  step  9. 
Evaluating  a  clique  means  evaluating  a  block  of  mutually  recursive  predicates. 
This  is  done  as  described  in  section  4.6.  Nonrecursive  predicate  evaluation  is  done 
as  described  in  section  4.5.  [] 

8.11.1.1.  Type  Checking  Algorithm 

Type  checking  in  data/knowledge  base  processing  has  two  purposes:  the  first  pur¬ 
pose  is  to  determine  the  types  of  each  of  the  derived  predicates.  The  second  purpose  is 
to  ensure  that  the  type  of  each  respective  column  of  a  predicate  is  the  same  in  every 
occurrence  of  the  predicate,  which  is  similar  to  type  checking  by  a  compiler  of  any 
strongly  typed  language. 

The  type  of  each  column  of  a  base  predicate  is  fixed  at  the  time  it  is  created.  Dur¬ 
ing  the  type  checking,  the  type  of  each  column  of  a  base  predicate  is  available  from  the 
data  dictionary  of  the  database  that  stores  the  relations  corresponding  to  the  base 
predicates.  The  type  of  a  column  of  a  derived  predicate  is  not  known  before  the  type 
checking  is  performed.  It  is  inferred  from  relationships  in  the  rules  using  the  type  of 
columns  of  base  predicates  and  the  type  of  columns  of  the  derived  predicates  which  are 
already  inferred. 

Example: 

Rl:  p(X,  Y)  -  bl(X,  Y). 

Here  the  type  of  the  first  (respectively,  second)  column  of  p  is  the  same  as  that  of 
the  first  (respectively,  second)  column  of  bl. 

R2:  p(X,  Y)  -  b2(X,Z),  q(Z,  Y). 

Here  type  of  the  first  column  of  p  is  the  same  as  that  of  the  first  column  of  b2  and 
the  type  of  the  second  column  of  p  is  the  same  as  the  type  of  the  second  column  of  q. 
Additionally,  the  type  of  the  first  column  of  q  is  the  same  as  the  type  of  second  column 
of  b2.  U 

If  the  rule  set  contains  both  rules,  the  type  of  columns  of  p  inferred  using  both 
rules  should  be  the  same.  Thus,  in  the  above  example,  if  type  of  the  first  column  of  bl 
is  not  the  same  as  that  of  the  first  column  of  b2  then  the  type  checking  will  give  error 
since  different  types  for  the  first  column  of  p  will  be  inferred  using  the  two  rules.  Simi¬ 
larly,  if  the  type  of  the  second  column  of  bl  is  not  the  same  as  that  of  the  second 
column  of  q,  the  type  checking  will  given  an  error  since  different  types  of  the  second 
column  of  p  will  be  inferred  using  the  two  rules. 

Type  checking  is  easy  to  do  for  a  nonrecursive  predicate.  However,  for  recursive 
predicates,  it  may  involve  several  iterations  until  either  a  closure  is  reached  or  a  type 
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We  propose  four  relations  as  the  storage  structures  for  rules.  These  relations  are 
called  isystables,  isyscolumns,  irulesource,  ireachablepreds.  isystables  and  isyscolumns 
are  the  data  dictionary  of  the  intensional  knowledge  base.  They  contain  the  types  of 
the  columns  of  the  derived  predicates.  These  types  are  inferred  using  the  type  checking 
algorithm  described  in  the  last  section. 

The  rule  storage  structures  have  the  following  schema: 

isystables (tablename  ehar,  tableid  integer) 

isyscolumns (tableid  integer ,  colname  char,  colnumber  integer,  coltype  integer) 

irulesource  stores  for  each  derived  predicate  p,  the  rules  defining  p.  It  has  the  fol¬ 
lowing  schema: 

irulesource (headpredname  char,  rule  char) 

ireachablepreds  is  the  transitive  closure  of  the  PCG  of  the  rules  stored  in  iru¬ 
lesource.  It  stores  for  each  derived  predicate  p  all  the  predicates  reachable  from  p.  It 
has  the  following  schema: 

ireachablepreds (f rompredname  char,  topredname  char). 

We  now  illustrate  how  the  ancestor  rules  would  be  stored  using  the  above  storage 
structures.  Assume  that  both  columns  of  the  parent  relation  are  of  type  char(30). 
Then  the  type  checking  algorithm  will  infer  that  both  columns  of  ancestor  are  also  of 
type  char(30).  We  add  tuples  to  the  storage  structures  as  follows. 

isystables  (ancestor ,  tanc) 


isyscolumns  (t  ,  ancestor_col_  1,  1,  char( 30)) 
isyscolumns (<  ,  ancestor _col_2,  2,  e/iar(30)) 

trulesourc e  (ancestor  "ancestor (X,  Y)  -  parent(X,  Y).") 

irulesource  (ancestor  "ancestor  (X,  Y)  -  parent(X,  Z),  ancestor(Z,  F).") 

ireachablepreds  (ancestor ,  parent ) 
ireachablepreds  (ancestor ,  ancestor ) 

The  major  advantage  of  ireachablepreds  as  a  compiled  form  of  the  rules  is  that  it  allows 
very  efficient  retrieval  of  the  relevant  rules  from  the  intensional  knowledge  base.  For 
example,  suppose  the  stored  D/KB  contains  the  following  rules: 


RyPiX,  Y)-a(X,  Z),  q(Z,  Y). 
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R4:  q(X,  Y)~c( X,  T). 


Rs:c(X,  Y)~b3(X,  Y). 

Rt:  m[X,  Y)-bA(X,  Y). 

where  the  6f  ’s  are  base  predicates.  Then  retrieving  all  the  rules  needed  to  solve  the 
query 

query(X,  Y)  -  p{X,  Z),  m(Z,  Y). 
is  accomplished  via  the  following  SQL  query: 

SELECT  irulesource. rule 
FROM  irulesource ,  ireachablepreds 

WHERE  (ireachablepreds. topredname  =  irulesource. headpredname  OR 

ireachablepreds. f rompredname  =  irulesource. headpredname)  AND 
ireachablepreds. f rompredname  =  "p"  OR 
ireachablepreds.  f  rompredname  =  "m”)) 

This  query  retrieves  rules  Rx  through  above.  To  speed  up  the  execution  of  this 
query,  we  place  a  composite  index  on  the  columns  of  ireachablepreds. 

We  now  describe  the  update  algorithm.  Let  A DKB  denote  the  workspace  D/KB. 

1.  Extract  from  the  stored  D/KB  all  the  rules  needed  to  evaluate  the  derived  predi¬ 
cates  in  A  DKB.  This  can  be  done  using  a  query  like  the  above  SQL  query.  Let 
IKBrel  denote  the  extracted  rules. 

2.  Construct  the  PCG  of  the  rules  in  AD/C£e()mp0Jtif  =  ADKjB  (J  which 

denotes  the  set  of  rules  which  are  either  in  A  DKB  or  in  IXBftl . 

3.  Compute  the  transitive  closure  of  this  PCG.  This  gives  all  the  predicates  reachable 
from  a  given  predicate  in  A DKBcompgtile. 

4.  Perform  the  two  semantic  checks  described  in  step  10  of  scenario  1. 

5.  For  each  derived  predicate  p  in  AZ?/CZ?cflmpM.<f ,  add  tuples  to  isystables  and  isys- 
columns  if  information  on  p  is  not  present  in  these  tables. 

6.  For  each  derived  predicate  p  in  AZ?/C£compaj((f  add  tuples  to  ireachablepreds  by 
looking  at  the  transitive  closure  computed  in  step  3. 


i 

s 

i 

8SS 


$ 


m 

m 

m 


$ 

i* 


»S 


7.  For  each  rule  in  A DKB,  add  tuples  to  irulesource .  [] 

We  mentioned  before  that  ireachablepreds  is  the  transitive  closure  of  the  PCG  of 
the  rules  in  irulesource .  The  above  algorithm  computes  this  transitive  closure  incre¬ 
mentally.  That  is,  whenever  the  stored  D/KB  is  to  be  updated,  we  do  a  transitive  clo¬ 
sure  on  only  those  portions  of  the  stored  D/KB  that  will  be  affected  by  the  update 
( IKBrt[ ),  and  not  of  the  entire  stored  D/KB.  This  can  result  in  substantia]  savings  in 
update  times  for  very  large  rule  sets  as  the  size  of  IKBrel  will  be  much  smaller  than  that 
of  the  entire  rule  set. 

8.11.3.  Scenario  2 

In  this  section,  we  describe  workspace  and  stored  D/KB  query  processing  for  the 
case  where  the  rules  in  the  workspace  D/KB  and  the  stored  D/KB  may  refer  to  each 
other.  In  this  case,  we  need  to  extract  all  the  relevant  rules  from  the  stored  D/KB.  We 
replace  step  3  of  scenario  1  with  the  following  steps. 

3.1.  Compute  the  transitive  closure  of  A  DKB. 

3.2.  From  the  transitive  closure,  find  the  predicates  reachable  from  the  query.  Let  P 
denote  this  set  of  predicates. 

3.3.  Extract  from  the  stored  D/KB  all  the  rules  needed  to  evaluate  the  predicates  in  P. 
Let  IKBfe[  denote  the  extracted  rules. 

4.  Compute  the  transitive  closure  of  A  DKBcompojrf(,  =  A  DKB  y  /A7?re/.  This  gives 
the  correct  set  of  relevant  predicates  and  rules.  [] 

We  point  out  that  ireachablepreds  makes  it  possible  to  efficiently  extract  the 
relevant  rules  from  the  stored  D/KB.  This  efficiency  will  translate  to  higher  perfor¬ 
mance,  especially  for  very  large  rule  sets. 

8.12.  Steps  in  D/KBMS  Architecture  Specification 

This  section  presents  an  overall  procedure  for  specifying  D/KBMS  architectures. 
The  steps  below  are  not  intended  to  be  comprehensive.  Where  gaps  exist,  the  D/KBMS 
designer  should  consult  the  policies  described  earlier  in  this  chapter  to  determine  an 
appropriate  course  of  action. 

•  Design  a  high  performance  parallel  relational  database  machine  that  employs  the 
parallel  and  pipelined  join  algorithms  described  in  chapter  6. 

•  Implement  the  hardware  and  software  fault  tolerance  techniques  described  in 
chapter  7. 

•  Design  parallel  algorithms  for  general  LFP  evaluation  using  the  above  join  stra¬ 
tegies,  data  flow  and  pipelining  techniques,  and  semi-naive  evaluation. 

•  Design  parallel  algorithms  for  special  LFP  operators  such  as  transitive  closure  using 
the  HYBRIDTC  strategy  outlined  in  chapter  5. 

•  Design  a  Knowledge  Manager  that  compiles  Horn  clause  queries  to  relational  alge¬ 
bra  augmented  with  a  general  LFP  operator  and  that  uses  the  generalized  magic 
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sets  strategy  for  restricting  the  search  space  to  the  relevant  base  relation  tuples. 
The  Knowledge  Manager  should  follow  the  steps  for  D/KB  query  and  update  pro¬ 
cessing  described  in  the  previous  section. 

8.13.  Conclusions 

This  chapter  presented  a  methodology  for  specifying  high  performance,  highly 
available,  large  D/KBMS  architectures.  The  methodology  was  described  as  a  set  of  poli¬ 
cies  and  steps.  The  policies  are  meant  to  serve  as  a  guide  to  the  D/KBMS  designer  in 
making  appropriate  decisions  for  the  following  critical  D/KBMS  design  issues:  overall 
D/KBMS  functionality,  knowledge  representation,  rule  storage,  D/KB  query  processing, 
D/KB  update  processing,  D/KBMS  functional  partitioning,  LFP  evaluation,  join  pro¬ 
cessing,  D/KBMS  hardware  architecture,  and  fault  tolerance.  The  steps  were  presented 
as  a  recipe  for  D/KB  query  and  update  processing  and  for  D/KBMS  architecture 
specification. 
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CHAPTER  9 


VLPDF  Demonstration  Testbed 


This  chapter  describes  a  data/knowledge  base  management  testbed  that  we  have 
designed  and  implemented  on  top  of  a  commercial  relational  database  system.  The 
testbed  is  intended  to  serve  as  both  a  demonstration  and  performance  measurement  and 
evaluation  platform.  As  a  demonstration  platform,  the  testbed  illustrates  the  motiva¬ 
tion  and  basic  functionality  of  a  D/KBMS,  the  components  of  a  D/KBMS  architecture, 
alternative  implementations  of  these  components  and  their  relative  tradeoffs,  and  the 
factors  contributing  to  D/KB  query  compilation  and  execution  time.  As  a  performance 
measurement  and  evaluation  platform,  the  testbed  allows  us  to  make  quantitative  per¬ 
formance  measurements  and  to  study  system  performance  sensitivity  and  behavior  with 
respect  to  several  parameters. 

9.1.  VLPDF  Demonstration  Testbed  Architecture 

The  VLPDF  demonstration  testbed  is  built  on  top  of  an  existing  testbed  —  the 
Informix  Relational  Database  Management  System  (RDBMS)  —  and  is  implemented  in  a 
Unix  4.2  BSD  environment  on  Apollo  workstations.  The  overall  configuration  of  the 
testbed  is  shown  in  Figure  9.1.  The  testbed  consists  of  four  components:  User  Interface, 
Knowledge  Manager,  Informix  Relational  DBMS,  and  C  Compiler. 

The  Knowledge  Manager  together  with  the  Informix  RDBMS  constitutes  the 
D/KBMS.  The  User  Interface  manages  the  interaction  between  users  and  the  D/KBMS. 
The  users  can  be  either  humans  or  application  programs  (we  also  view  expert  system  as 
application  programs). 

The  Knowledge  Manager  is  essentially  a  compiler.  It  accepts  Horn  clauses  and 
queries  from  the  User  Interface  and  compiles  queries  into  C  code  fragments.  The  C 
code  fragment  is  then  compiled  by  the  C  compiler  and  linked  with  a  run-time  library  to 
produce  the  object  code,  which  is  executed  by  the  User  Interface  as  an  application  pro¬ 
gram  against  the  Informix  RDBMS  to  give  the  query  results.  The  C  code  fragment  con¬ 
tains  information  specific  to  the  query,  while  the  run-time  library  contains  the  algo¬ 
rithms  for  LFP  evaluation  and  miscellaneous  utilities. 

9.1.1.  User  Interface 

The  User  Interface  provides  the  following  options: 

Quit 

This  option  terminates  the  current  session  with  the  demonstration  testbed. 
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having  a  Horn  clause  query  language  embedded  within  it. 

This  option  allows  the  user  to  enter  and  query  single-solution  PARLOG  clauses. 
These  clauses  may  contain  calls  to  all-solutions  relations  in  their  body.  The  Horn 
clauses  defining  these  all-solutions  relations  are  assumed  to  be  in  the  stored  D/KB. 

D/KBMS  Interaction 

This  option  allows  the  user  to  interact  with  the  D/KBMS.  The  D/KBMS  interac¬ 
tion  options  constitute  the  second  level  menu  and  are  described  below. 

Set  D/KB 

This  option  allows  the  user  to  specify  the  name  of  the  stored  D/KB 
against  which  to  process  queries  and  updates. 

Enter  Rules 

This  option  allows  the  user  to  enter  a  set  of  Horn  clauses  into  the 
workspace  D/KB,  either  interactively  or  through  a  file. 

Enter  Query 

This  option  allows  the  user  to  enter  a  query.  The  entered  query  will  get 
compiled  by  the  Knowledge  Manager.  The  User  Interface  prompts  the  user 
for  the  query  and  for  the  name  of  the  object  file  in  which  to  put  the  com¬ 
piled  query.  It  then  puts  the  entered  query  into  a  file  and  requests  the 
Knowledge  Manager  to  compile  the  query. 

Execute  Query 

This  option  allows  the  user  to  execute  a  previously  entered  query.  The 
User  Interface  executes  the  previously  compiled  query  as  an  application  pro¬ 
gram  against  the  Informix  RDBMS. 

Update  Stored  D/KB 

This  option  allows  the  user  to  update  the  stored  D/KB  with  rules  and 
facts  from  the  workspace  D/KB. 

Quit 

This  option  allows  the  user  to  return  to  the  top-level  menu. 

Initialize  D/KBMS 

This  option  allows  the  user  to  initialize  the  major  data  structures  of  the 
Knowledge  Manager. 

List  Workspace  D/KB  Rules 

This  option  allows  the  user  to  view  the  rules  present  in  the  workspace 
D/KB. 

List  Relevant  Stored  D/KB  Rules 

This  option  allows  the  user  to  view  all  the  rules  present  in  the  stored 
D/KB  required  to  evaluate  a  given  predicate.  The  User  Interface  prompts 
the  user  for  the  predicate  name  and  arity.  It  then  requests  the  Knowledge 
Manager  to  list  the  stored  rules  needed  to  evaluate  this  predicate,  giving  it 


I 


4 


icJ  At  Ai 


of*  Jt- 

O’  *v  r 

S 


•C 


S&v 

•v\y.y 


■  *>  7  ■ 

%  *V  K. 


the  predicate  name,  the  arity,  and  the  stored  D/KB  name.  The  Knowledge 
Manager  extracts  the  relevant  rules  from  the  storage  structures  and  puts 
them  in  a  file,  which  is  then  displayed  by  the  User  Interface. 

9.1.2.  Knowledge  Manager 

In  this  section,  we  describe  the  architecture  of  the  Knowledge  Manager.  The 
Knowledge  Manager  consists  of  the  following  components:  Rule  Parser,  Stored  D/KB 
Manager,  Workspace  D/KB  Manager,  Semantic  Checker,  and  Code  Generator.  The 
architecture  of  the  Knowledge  Manager,  indicating  the  interconnections  between  these 
components,  is  shown  in  figure  9.2.  The  circles  in  this  figure  represent  data  structures 
and  the  boxes,  components. 

First,  we  list  the  functions  provided  in  the  Knowledge  Manager  interface  and  the 
functions  provided  in  the  various  components’  interface.  Second,  we  describe  the  major 
data  structures  of  the  Knowledge  Manager.  Finally,  we  describe  the  processing  done  by 
the  Knowledge  Manager  to  implement  the  functions  in  its  interface. 

9. 1.2.1.  Knowledge  Manager  Interfaces 
Overall  Knowledge  Manager  Functions 

Initialize  D/KBMS 

Enter  Horn  clauses 

Compile  query 

Update  stored  D/KB 

List  workspace  D/KB  rules 

List  relevant  stored  D/KB  rules 


Workspace  D/KB  Manager  Interface 

Compute  PCG  transitive  closure 

Extract  relevant  predicates 
Find  cliques 


Generate  evaluation  order  list 


Rule  Parser 


File  of 
rules  or 
queries 


Predicates] 


Stored  D/KB 
Manager 

Workspace 
D/KB  Mgr 

Semantic 

Checker 

Optimizer 

Code 

Generator 

query  execution 
environment 


fragment 


To  DBMS 


Figure  9.2.  Knowledge  Manager  Architecture 

Stored  D/KB  Manager  Interface 

Extract  relevant  rules 

Insert  temporary  base  predicates 
Read  data  dictionary 
Update  D/KB 
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Optimizer  Interface 


Generate  adorned  rule  set 


Generate  mafic  and  modified  rules 


9. 1.2. 2.  Knowledge  Manager  Data  Structures 

The  major  data  structures  in  the  Knowledge  Manager  are:  rules,  predicates,  and 
query  execution  environment. 

Rules 

This  a  hash  table  with  each  bucket  containing  the  following  information: 

•  Rule  id. 

•  Name  and  list  of  arguments  of  each  predicate  as  indicated  in  the  source 
form  of  the  rule. 

•  Other  information  about  each  predicate  (see  below). 

Predicates 

This  a  hash  table  with  each  bucket  containing  the  following  information: 

•  Predicate  id. 

•  Internal  name. 

•  Type,  i.e.,  base  or  derived. 

•  Arity  (number  of  arguments). 

•  Schema  information,  i.e.,  name  and  type  of  each  column. 

•  List  of  rules  for  which  the  predicate  is  a  head. 

•  List  of  rules  for  which  the  predicate  is  a  body  predicate. 

Query  execution  environment 

The  query  execution  environment  data  structure  contains  all  the  information  neces¬ 
sary  to  execute  a  query  to  the  data/knowledge  base. 
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•  tcpcg:  transitive  closure  of  the  PCG  of  the  rules  in  the  workspace  D/KB 
(the  transitive  closure  graph  is  represented  as  a  boolean  matrix). 

•  relpreds :  list  of  predicates  and  rules  relevant  to  the  query. 

•  cliques:  list  of  cliques  in  the  workspace  D/KB.  Each  clique  contains: 

-  Clique  id. 

-  List  of  exit  rules. 

-  List  of  recursive  rules. 

-  List  of  recursive  predicates. 

•  evalorderlist:  list  of  nodes  of  the  evaluation  order  graph  for  the  query 
in  topologically  sorted  order. 


9. 1.2. 3.  Knowledge  Manager  Processing 

In  this  section,  we  describe  the  processing  done  by  the  Knowledge  Manager  to 
implement  the  functions  provided  by  its  interface. 

Initialize  D/KBMS 

The  Knowledge  Manager  sets  the  rules  and  predicates  hashtables  to  null. 

Enter  Horn  Clauses 

The  Knowledge  Manager  asks  the  Rule  Parser  to  parse  the  Horn  clauses.  The  Rule 
Parser  loads  the  rules  and  predicates  hash  tables. 

Compile  Query 

I.  The  Knowledge  Manager  constructs  a  query  execution  environment  data  structure. 
It  then  asks  the  Workspace  D/KB  Manager  to  extract  the  predicates  relevant  to 
the  query  from  the  workspace  D/KB.  These  are  the  predicates  reachable  from  the 
query.  The  Workspace  D/KB  Manager  computes  the  transitive  closure  of  the  PCG 
of  the  workspace  D/KB  rules  to  determine  the  reachable  predicates.  It  fills  the 
tcpcg  and  relpreds  fields  of  the  query  execution  environment. 

The  Knowledge  Manager  asks  the  Stored  D/KB  Manager  to  extra..'  from  the  stored 
D/KB  all  the  rules  needed  to  evaluate  the  relevant  predicates  found  in  the  previous 
step.  The  Stored  D/KB  Manager  extracts  the  relevant  rules  and  calls  the  Rule 
Parser  to  load  them  into  the  rules  and  predicates  hash  tables.  The  extracted  rules 
then  become  part  of  the  workspace  D/KB. 

The  Knowledge  Manager  asks  the  Workspace  D/KB  Manager  to  extract  the 
relevant  predicates  from  the  workspace  D/KB. 

The  Knowledge  Manager  asks  the  Optimizer  to  rewrite  the  workspace  D/KB  rules 
into  a  more  efficiently  executable  form.  First,  the  Optimizer  generates  adorned 
versions  of  the  relevant  predicates  and  adorned  rules  defining  these  predicates, 
including  the  adorned  version  of  the  query  (the  result  of  evaluating  the  adorned 
query  against  the  adorned  rule  set  is  the  same  as  the  result  of  evaluating  the 
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original  query  against  the  original  rule  set).  Second,  it  calls  the  Rule  Parser  to  load 
the  adorned  rules  and  predicates  into  the  rules  and  predicates  hash  tables.  Third, 
it  generates  the  magic  and  modified  rules.  Fourth,  it  calls  the  Rule  Parser  to  load 
them  into  the  rules  and  predicates  hash  tables.  Finally,  the  Optimizer  asks  the 
Stored  D/KB  Manager  to  insert  the  relevant  base  predicates  present  in  the 
workspace  D/KB  into  the  stored  D/KB. 

5.  The  Knowledge  Manager  sets  the  tcpcg  and  rtlpreds  fields  of  the  query  execution 
environment  to  null.  It  then  asks  the  Workspace  D/KB  Manager  to  extract  the 
predicates  relevant  to  the  adorned  query  from  the  workspace  D/KB.  These  are  the 
predicates  reachable  from  the  adorned  query.  The  Workspace  D/KB  Manager  fills 
the  tcpcg  and  relpreds  fields  of  the  query  execution  environment. 

6.  The  Knowledge  Manager  asks  the  Workspace  D/KB  Manager  to  find  the  cliques  in 
the  workspace  D/KB.  The  Workspace  D/KB  Manager  fills  the  cliques  field  of  the 
query  execution  environment.  It  assigns  an  id  to  each  clique  and  fills  in  the  exit 
rules,  recursive  rules,  and  recursive  predicates  for  it. 

7.  The  Knowledge  Manager  asks  the  Workspace  D/KB  Manager  to  generate  the 
evaluation  order  list.  The  evaluation  order  list  is  a  topological  sort  of  the  evalua¬ 
tion  graph  of  the  adorned  query.  The  Workspace  D/KB  Manager  fills  the  evalor- 
derlist  field  of  the  query  execution  environment. 

8.  The  Knowledge  Manager  asks  the  Semantic  Checker  to  perform  semantic  checks. 
The  Semantic  Checker  performs  two  types  of  checks.  The  first  is  to  check  for  each 
derived  relevant  predicate  whether  there  is  a  rule  in  the  workspace  D/KB  defining 
it.  The  second  is  a  type  check.  The  Semantic  Checker  asks  the  Stored  D/KB 
Manager  to  read  the  extensional  and  intensional  data  dictionaries  to  get  schema 
information  from  the  stored  D/KB.  See  section  2.8  for  the  details  of  the  type 
check  algorithm.  If  the  Semantic  Checker  reports  no  errors,  the  Knowledge 
Manager  performs  the  next  step. 

9.  The  Knowledge  Manager  asks  the  Code  Generator  to  generate  the  code  for  each 
entry  in  the  cvalorderlist.  The  Code  Generator  generates  a  C  code  fragment  con¬ 
taining  information  specific  to  the  query. 

Update  Stored  D/KB 

1.  The  Knowledge  Manager  constructs  a  query  execution  environment  data  structure. 
It  then  asks  the  Stored  D/KB  Manager  to  extract  from  the  stored  D/KB  all  the 
rules  needed  to  evaluate  the  predicates  in  the  workspace  D/KB.  The  Stored  D/KB 
Manager  extracts  the  relevant  rules  and  calls  the  Rule  Parser  to  load  them  into  the 
rules  and  predicates  hash  tables.  The  extracted  rules  then  become  part  of  the 
workspace  D/KB. 

2.  The  Knowledge  Manager  asks  the  Workspace  D/KB  Manager  to  compute  the  tran¬ 
sitive  closure  of  the  PCG  of  the  rules  in  the  workspace  D/KB.  The  Workspace 
D/KB  Manager  fills  the  tcpcg  field  of  the  query  execution  environment. 

3.  The  Knowledge  Manager  sets  the  relpreds  field  of  the  query  execution  environment 
to  the  list  of  all  predicates  in  the  workspace  D/KB. 
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4.  The  Knowledge  Manager  asks  the  Workspace  D/KB  Manager  to  find  the  cliques  in 
the  workspace  D/KB.  The  Workspace  D/KB  Manager  fills  the  cliques  field  of  the 
query  execution  environment. 

5.  The  Knowledge  Manager  asks  the  Workspace  D/KB  Manager  to  generate  the 
evaluation  order  list.  The  Workspace  D/KB  Manager  fills  the  evalorderlist  field  of 
the  query  execution  environment. 

6.  The  Knowledge  Manager  asks  the  Semantic  Checker  to  perform  the  two  semantic 
checks  mentioned  above.  If  the  Semantic  Checker  reports  no  errors,  the 
Knowledge  Manager  performs  the  next  step. 

7.  The  Knowledge  Manager  asks  the  Stored  D/KB  Manager  to  update  the  stored 
D/KB.  The  Stored  D/KB  Manager  updates  the  intensional  data  dictionary  and 
stores  the  compiled  and  source  forms  of  the  workspace  D/KB  rules. 

List  relevant  stored  D/KB  rules 

The  Knowledge  Manager  asks  the  Stored  D/KB  Manager  to  extract  from  the  stored 
D/KB  all  the  rules  needed  to  evaluate  the  indicated  predicate.  It  then  puts  these  rules 
in  a  file  and  asks  the  User  Interface  to  display  it. 

9.1.3.  Compiled  Code  and  Run-time  Library 

The  Knowledge  Manager  compiles  data/knowledge  base  queries  into  C  code  frag¬ 
ments.  The  C  compiler  compiles  a  fragment  and  links  it  with  the  run-time  library  to 
produce  an  object  program,  which  when  executed  against  the  Informix  RDBMS  gives 
the  query  results.  In  this  section,  we  describe  the  contents  of  the  C  code  fragment  and 
the  run-time  library. 

C  code  fragment 

The  C  code  fragment  generated  by  the  Knowledge  Manager  basically  loads  certain 
data  structures  in  the  object  program  with  information  specific  to  the  query.  These 
data  structures  contain  information  similar  to  the  nodes  of  the  evaluation  order  graph 
of  the  query.  Recall  that  this  graph  contains  two  types  of  nodes  —  predicates  and 
cliques.  The  predicate  nodes  contain  information  about  evaluating  non-recursive  predi¬ 
cates,  while  clique  nodes  contain  information  about  evaluating  a  block  of  mutually 
recursive  predicates. 

For  predicate  nodes,  the  C  code  fragment  loads  the  predicate  name,  schema  infor¬ 
mation  (name  and  type  of  each  column),  and  the  SQL  query  to  evaluate  the  body  of 
each  rule  in  which  the  predicate  appears  as  head  (see  section  2.5  for  what  this  SQL 
query  looks  like).  For  clique  nodes,  the  C  code  fragment  loads  the  same  information, 
except  that  it  differentiates  between  exit  rules  and  recursive  rules. 

Run-time  library 

The  run-time  library  contains  routines  that  interpret  the  information  loaded  by  the 
C  code  fragment.  Non-recursive  predicates  are  evaluated  as  described  in  section  2.5, 
while  recursive  predicates  as  in  section  2.6.  The  major  contents  of  the  run-time  library 
are  the  embedded  SQL  routines  for  naive  and  semi-naive  evaluation  of  a  system  of 


recursive  equations.  We  describe  these  below. 

naive_evaluation  (clique) 

/“  Embedded  SQL  algorithm  for  naive  evaluation  of  a  clique  '/ 


changed  =  TRUE, 

/"  Initialization  '/ 

foreach  predicate  p  in  the  clique  { 

/■  Create  temporary  tables  ' / 
create  table  delta_p , 
create  table  p; 
create  table  temp_p, 
create  table  eat_p , 

/*  Evaluate  exit  rules  '/ 

insert  into  p  tuples  resulting  from  evaluating  the  SQL  query  associated 
with  each  exit  rule  for  which  p  is  the  head; 

copy  p  into  f ni  p 


/■  While  loop  7 
while  (changed)  { 

changed  =  FALSE, 

foreach  predicate  p  in  the  clique  { 
copy  enf_p  into  delta_p, 

/’  Evaluate  right  hand  side  of  recursive  equation  '/ 

insert  into  detta_p  tuples  resulting  from  evaluating  the  SQL  query 

associated  with  each  recursive  rule  for  which  p  is  the  head, 


/"  Termination  check  '  j 
foreach  predicate  p  in  the  clique  { 
temp_p  =  delta _p  -  p; 

If  there  are  tuples  in  temp_p 
changed  =  TRUE, 

delete  all  tuples  from  temp_p; 


foreach  predicate  p  in  the  clique  { 
drop  table  p, 
rename  table  delta_p  to  p; 
create  table  delta  p; 

> 


/*  Clean  up  "/ 

foreach  predicate  p  in  the  clique  { 
drop  table  delta_p , 
drop  table  temp_p; 
drop  table  exit  p, 

} 


seminaive_evaluation  (clique) 
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/’  Embedded  SQL  algorithm  for  semi-naive  evaluation  of  a  clique  "/ 


changed  =  TRUE, 

/’  Initialnation  ’/ 

foreach  predicate  p  in  the  clique  { 

/"  Create  temporary  tables  '/ 
create  table  p; 
create  table  neui.p; 
create  table  DELTAjp, 
create  table  delta_p ; 

/'  Evaluate  exit  rules  */ 

insert  into  delta^p  tuples  resulting  from  evaluating  the  SQL 
query  associated  with  each  exit  rule  for  which  p  is  the  head; 

copy  delta_p  to  neu>_p, 


/’  While  loop  "/ 
while  (changed)  { 

fore&ch  predicate  p  in  the  clique  { 

delete  all  tuples  from  DELTA_p ; 

/"  Evaluate  right  hand  side  of  recursive  equations  "/ 
insert  into  DELTAjp  tuples  resulting  from  evaluating  the 
differential  of  each  recursive  rule  for  which  p  is  the  head. 


foreach  predicate  p  in  the  clique  { 
copy  delta_p  to  p; 
delete  all  tuples  from  delta_p- 

/’  Set  difference  operation  for  termination  check  '/ 
delta_p  =  DELTA_p  -  p; 


foreach  predicate  p  in  the  clique  { 
copy  delta_p  to  NEW_p , 

If  there  are  tuples  in  NEW^p 
changed  =  TRUE, 


/’  Wrap  up  '/ 

foreach  predicate  p  in  the  clique  { 
drop  table  p; 
rename  table  nev_p  to  p; 
drop  table  DELTA_p\ 
drop  table  delta  p ; 

} 
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CHAPTER  10 


Demonstration  Plan 


This  chapter  describes  the  VLPDF  demonstration  plan.  The  demonstration  will 
consist  of  three  experiments  designed  to  demonstrate  the  motivation  and  functionality 
of  a  D/KBMS  and  the  components  of  a  D/KBMS  architecture.  The  next  chapter 
describes  several  tests  designed  to  study  the  relative  tradeoffs  in  D/KBMS  design  and 
the  factors  contributing  to  D/KB  query  compilation,  execution,  and  update  times. 

The  chapter  is  organized  as  follows.  Sections  10.1  and  10.2  describe  the  demons¬ 
tration  data  base  and  rule  base  respectively.  Sections  10.3  through  10.5  describe  the 
demonstration  experiments.  For  each  experiment,  we  describe  the  objective,  the  back¬ 
ground,  and  the  experimentation  procedure. 

10.1.  Demonstration  Data  Base 

The  demonstration  database  will  consist  of  the  following  relations: 
parcnt(childname  char(20),  parentname  char(20)). 
person(name  char(20),  sez  char(20)). 

parts(partid  integer,  name  char(20 ),  units  char(10),  weight  integer). 

supplier(name  char(20),  address  char(50),  supplierid  integer). 

bl{coll  integer,  col2  integer). 

b2{coll  integer,  col2  integer). 

b3(coll  integer,  col2  integer). 

b4(coll  integer,  eol2  integer). 

bb[coll  integer,  col2  integer). 

10.2.  Demonstration  Rule  Base 

The  demonstration  rule  base  will  consist  of  1000  rules,  11  of  which  are  listed  below, 
the  rest  being  random  generated. 

Ancestor  rules 

Ry  anc{X,  Y)  -  parent [X,  Y). 

R2 :  anc(X,  Y)  -  parent(X,  Z),  anc(Z ,  F). 

Non-linear  version  of  ancestor  rules 

R3:  anc_nl( X,  Y)  -  parent(X,  Y). 
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R4 :  anc_nl(X,  Y)  -  anc_nl{ X,  Z),  anc_nl(Z,  K). 

Rules  with  multiple  cliques  and  mutual  recursion 
Rs:  p(X,  Y)~Pl( X,  Z),  q(Z,  Y). 

p(X,  Y)  -  b,(X,  Y). 

R7:  Y )  -  b^X,  Z),  Pl(Z,  Y). 

'  P,(X,  Y)  -  b,(X,  Y). 

R,-  p2(X,  Y)  -  b,(X,  Z),  p,(Z,  Y). 
p2(X,  r)  -  6 S(X,  r). 

?(-V,  Y)  -  p(X,  Z),  p,(Z,  Y). 

10.3.  Experiment  1:  D/KBMS  Motivation  and  Basic  Functionality 
Objective 

•  Demonstrate  the  motivation  for  combining  Knowledge  Based  Systems  (KBSs)  and 
DBMS  technologies  to  yield  a  D/KBMS  —  efficiently  managing  access  to  large, 
shared  data/knowledge  bases  and  improving  productivity  and  functionality  of 
information  systems. 

•  Demonstrate  the  basic  functions  of  a  D/KBMS  —  knowledge  representation,  D/KB 
query  language,  D/KB  query  processing,  and  D/KB  updates. 

Background 

Both  KBS  and  DBMS  technologies  are  designed  for  access  and  manipulation  of 
information.  Each  has  the  potential  for  increasing  the  productivity  and  ease  of  use  of 
the  other.  KBS  technology  provides  techniques  for  acquiring  and  representing  domain 
knowledge.  It  provides  increased  functionality  for  information  systems  via  knowledge 
directed  reasoning.  It  provides  increased  productivity  for  information  systems  via  its 
clean  separation  between  the  knowledge  base  and  inference  engine.  This  separation 
allows  incremental  incorporation  of  new  knowledge  as  it  becomes  available  without 
changing  the  reasoning  mechanisms.  It  also  allows  certain  information  retrieval  tasks 
requiring  application  program  development  using  a  DBMS  to  be  expressed  as  D/KB 
queries.  This  advantage  becomes  very  apparent  in  information  retrieval  tasks  involving 
recursive  queries.  These  tasks  require  application  program  development  when  using  a 
relational  DBMS,  since  relational  algebra  is  incapable  of  expressing  recursive  queries. 
However,  using  a  data/knowledge  base  management  system  they  can  be  expressed 
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simply  as  queries,  since  such  a  system  can  process  recursive  requests.  The  added  func¬ 
tionality  of  a  D/KBMS  results  in  productivity  gains  for  the  end  user. 

Thus  far,  KBS  technology  has  not  addressed  systems  and  efficiency  issues.  The 
knowledge  bases  in  current  KBS  applications  are  small  (typically,  memory-resident)  and 
difficult  to  share  between  applications.  DBMS  technology,  on  the  other  hand,  offers 
solutions  to  various  systems  issues  such  as  concurrency,  security,  integrity,  reliability, 
and  recovery.  It  offers  solutions  to  various  efficiency  issues  such  as  organizing  large 
amounts  of  data  and  search  and  query  optimizations. 

Thus,  KBS  and  DBMS  technologies  have  much  to  offer  each  other  —  DBMS  tech¬ 
nology  solutions  to  systems  and  efficiency  issues  and  KBS  technology  improved  func¬ 
tionality  and  productivity.  This  synergy  is  the  motivation  for  combining  these  techno¬ 
logies  and  has  led  to  the  notion  of  a  D/KBMS,  a  tool  for  efficiently  managing  access  to 
large,  shared  data/knowledge  bases. 

Procedure 

1.  Look  at  the  contents  of  the  shared  data/knowledge  base,  noting  its  size. 

2.  Demonstrate  the  database  representation  and  manipulation  facilities  of  a  D/KBMS. 
The  VLPDF  demonstration  testbed  offers  logic  as  the  knowledge  representation  — 
specifically,  Horn  clauses  with  no  complex  terms  and  no  negation.  This  represen¬ 
tation  provides  the  power  of  expressing  relational  operators  in  way  simpler  than 
the  de  facto  relational  data  manipulation  language,  SQL. 

2a.  Demonstrate  a  selection  operation  —  "find  all  female  persons"  —  by  entering 
and  executing  the  query: 

query (X)  -  person[X ,  "female"). 

2b.  Demonstrate  a  projection  operation  —  "find  the  name  and  weight  of  every 
part"  —  by  entering  and  executing  the  query: 

query(X,  Y)  -  parts(NAl,  X,  NA2,  Y). 

2c.  Demonstrate  a  join  operation  —  "find  part  name  and  supplier  of  every 
part"  —  by  entering  and  executing  the  query: 

query (X,  Y)  -  parts {Z ,  X,  NA  1,  NA2),  supplier (Y ,  JVA3,  Z ). 

3.  Demonstrate  the  knowledge  representation  and  modeling  facilitites  of  a  D/KBMS. 
Look  at  some  of  the  rules  in  the  data/knowledge  base  to  get  an  idea  of  the  type  of 
knowledge  that  can  be  represented. 

3a.  Demonstrate  the  D/KB  query  language  by  entering  and  executing  the  query: 

query  (X)  -  anc  (" chris" ,  X). 

This  query  means  "find  all  ancestors  of  Chris". 

3b.  Enter  and  execute  the  query: 

query  (X)  -  anc  (A*,  "chris"). 
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This  query  means  "find  all  persons  whose  ancestor  is  Chris".  Comparing  this  query 
with  the  previous  one  demonstrates  the  expressive  power  of  the  Horn  clause  based 
query  language. 

4.  List  the  rules  in  the  workspace  to  see  the  definition  of  the  anc  relation. 

5.  Look  at  the  application  program  needed  to  determine  all  ancestors  of  a  given  per¬ 
son  when  using  a  relational  DBMS.  Compare  the  size  of  this  program  with  the  two 
rules  defining  the  anc  relation  to  get  an  idea  of  the  productivity  improvement 
made  possible  by  a  D/KBMS. 

6.  Demonstrate  the  ease  with  which  new  knowledge  can  be  added  by  entering  the 
non-linear  version  of  the  ancestor  rules  (rules  R3  and  RA)  into  the  workspace 
D/KB. 

7.  Enter  and  execute  the  query: 

query (X)  -  anc_nl("debby" ,  X). 

8.  List  the  workspace  D/KB  rules. 

9.  Update  the  stored  D/KB. 

10.4.  Experiment  2:  D/KBMS  Query  and  Update  Processing  Scenario 
Objective 

Demonstrate  the  various  steps  involved  in  processing  queries  and  updates  to  a 
large,  shared  data/knowledge  base. 

Background 

See  chapter  7. 

Procedure 

1.  Enter  the  following  rule  into  the  workspace  D/KB: 

f cmanc(X,  Y)  -  anc(X,  Y),  person(Y,  "female"). 

2.  Enter  the  following  query: 

query(X)  -  f emanc(" exigent" ,  X). 

3.  Demonstrate  each  step  for  compiling  queries  by  describing  its  purpose  and  the 
changes  it  causes  in  the  Knowledge  Manager’s  data  structures  —  rules,  predicates, 
and  query  execution  environment. 

4.  Execute  the  above  query.  Demonstrate  the  execution  of  the  C  code  fragment  by 
describing  the  information  loaded  into  the  object  program’s  data  structures. 
Demonstrate  the  execution  of  the  LFP  evaluation  routine  by  describing  the  results 
of  each  step  of  the  semi-naive  evaluation  algorithm. 
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5.  Enter  rules  f2g  through  /?15  into  the  workspace  D/KB  and  then  update  the  stored 
D/KB.  Demonstrate  each  step  in  update  processing  by  describing  its  purpose  and 
the  changes  it  causes  in  the  Knowledge  Manager’s  data  structures.  Also,  look  at 
the  contents  of  the  rule  storage  structures,  i.e.,  the  relations  irulesource ,  ireaeha- 
blepreds,  isystables,  and  isyscolumns . 

10.5.  Experiment  3:  D/KB  Query  Language  Embedded  in  PARLOG 
Objective 

Demonstrate  the  embedding  of  Horn  clauses  in  PARLOG. 

Background 

PARLOG  is  a  language  with  two  types  of  relations:  single-solution  relations  and 
all-solutions  relations.  All-solutions  relations  can  appear  in  the  body  of  single-solution 
relations.  The  evaluation  semantics  for  these  relations  are  completely  different.  Indeed, 
they  can  be  viewed  as  two  different  languages  —  all-solutions  relations  constituting  a 
query  language  and  single-solution  relations  a  parallel  applications  programming 
language.  An  all-solutions  relation  query  can  be  evaluated  by  computing  the  least  fixed 
point  of  the  Horn  clauses  defining  this  relation.  Thus,  PARLOG  can  be  considered  as 
having  a  Horn  clause  query  language  embedded  within  it.  This  experiment  will  demon¬ 
strate  this  embedding. 

Testbed  Configuration 

Same  as  for  experiment  1. 

Procedure 

First,  load  the  following  single-solution  program  into  the  PARLOG  environment. 


mode  partitionfpivot? ,  list?,  less-list',  gr eater- list’ ). 

partitionfu,  fvlxlj,  fvlylj,  z)  <-  v  <  u  :  partitionfu,  x 1,  yl,  z). 
partitionfu,  fv\xl],  y,  fvizlj)  <-  u  =<v  :  partitionfu,  x  1,  y,  zl). 
partitionfu,  (],  [],  []). 


mode  appendflistl?,  Hst2?,  appended-list* ),  sortflist?,  sorted-lisC ). 


appendffhd)  tail},  list2,  [hd  at  ail J)  <-  appendftail,  list2,  atail). 
appendff],  list2,  list2). 


sortffhditailj,  sorted- list)  <- 

partitionfhd,  tail,  listl,  list2), 
sort(listl,  sorted-listl), 
sort(list2,  sorted-list2), 

append($orted-listl,  [hd  sorted-list2j,  sorted-list). 

sort([]>  ID- 

Second,  enter  the  following  single-solution  PARLOG  query. 


(z  <=  AllsolfAncf' eugene",  x),  Personfx,  "female")))  &  SET(y,  x,  z),  Sortfy,  a)  & 


This  PARLOG  query  will  produce  a  sorted  list  of  the  female  ancestors  of  eugene.  Here, 
the  D/KB  query  Ancf'eugene",  x),  Personfx,  "female")  is  embedded  in  the  single-solution 
PARLOG  relation  SET.  The  D/KB  query  returns  a  list  of  the  female  ancestors  and 
binds  y  to  this  list.  The  sort  call  is  like  executing  an  application  program. 

Execute  the  above  PARLOG  query,  demonstrating  the  various  parallel  processes 
created  and  the  interactions  between  them. 


m 


-  216  - 


CHAPTER  11 


D/KBMS  Performance  Measurement  and  Evaluation 


The  demonstration  experiments  described  in  the  previous  chapter  demonstrated  the 
motivation  and  functionality  of  a  D/KBMS  and  the  components  of  a  D/KBMS  architec¬ 
ture.  This  chapter  describes  several  experiments  we  designed  to  quantitatively  measure 
D/KBMS  performance  and  to  understand  D/KBMS  performance  sensitivity  and 
behavior  with  respect  to  various  system  parameters.  The  basic  motivation  for  doing 
these  experiments  is  to  justify  the  D/KBMS  architecture  specification  methodology 
described  in  chapter  8.  That  is,  to  show  that  this  methodology  can  indeed  be  used  to 
design  high  performance  D/KBMSs. 

The  chapter  is  organized  as  follows.  Section  11.1  describes  D/KBMS  performance 
measures.  Section  11.2  describes  parameters  that  affect  these  measures.  Sections  11.3 
and  11.4  characterize  the  data  and  rule  bases  used  during  the  experimentation.  Section 
11.5  contains  a  description  of  the  tests  and  an  analysis  of  the  test  results.  Section  11.6 
describes  the  conclusions  drawn  from  the  experimentation. 

11.1.  D/KBMS  Performance  Measures 

The  main  D/KBMS  performance  measures  are  shown  in  table  11.1. 


tc  D/KB  query  compilation 
time 

t(  D/KB  query  execution  time 

tu  D/KB  update  time 


Table  11.1.  D/KBMS  performance  measures 

During  experimentation,  values  of  these  measures  were  obtained  by  averaging  over 
5  readings. 

The  components  comprising  D/KB  query  compilation  time,  te,  are  shown  in  table 
11.2.  Each  component  corresponds  to  a  step  in  the  compilation  procedure.  We  haven’t 
included  the  time  to  execute  the  magic  set  optimization  algorithm,  since  we  have  not 
implemented  this  algorithm  in  our  testbed. 
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Time  to  parse  the  query. 

Time  to  construct  the  query 
execution  environment  data 
structure. 

Time  to  extract  the  relevant 
predicates  from  the  workspace 

D/KB. 

Time  to  extract  the  relevant  rules 
from  the  stored  D/KB. 

Time  to  extract  the  relevant 
predicates  from  the  workspace 
D/KB  after  the  relevant  rules  from 
the  stored  D/KB  have  been 
extracted. 

Time  to  read  the  D/KB  data 
dictionaries. 

Time  to  find  the  cliques  in  the 
workspace  D/KB  rules. 

Time  to  generate  the  evaluation 
order  list,  i.e.,  to  construct  the 
evaluation  order  graph  and 
perform  a  topological  sort  of  this 
graph. 

Time  to  perform  the  semantic 
checks. 

Time  to  generate  the  code 
fragment. 

Time  to  compile  the  code  fragment 
and  link  it  with  the  run-time 
library. 


The  components  comprising  D/KB  query  execution  time  are  shown  in  table  11.3. 
Each  component  corresponds  to  a  step  in  LFP  evaluation. 


1 

el 

Time  for  initialization. 

t 

>2 

Time  for  while  loop  execution. 

t 

>3 

Time  for  dropping  derived 

predicate  and  other  temporary 

tables. 

Table  11.3.  tg  breakup 


The  components  comprising  D/KB  update  time  are  shown  in  table  11.4. 


ul 


u2 


Time  to  extract  the  rules  relevant 
to  the  Workspace  D/KB  from  the 
Stored  D/KB. 


Time  to  update  the  irulcsource 
relation. 


t 


u3 


Time  to  update  the 
ircackableprcds,  uyitables,  and 
iayscoltimns  relations. 


1  #*«  fi"  ll1  la*  b* Ja» * li» 1 


-  219  - 


11.3.  Data  Base  Characterization 

The  base  relations  used  in  the  experimentation  are  all  binary  relations.  We  charac¬ 
terize  them  in  terms  of  their  directed  graph  representation.  In  this  representation,  a 
binary  relation  is  represented  as  a  directed  graph;  domain  elements  form  the  nodes  of 
this  graph  and  tuples  the  edges. 

We  used  the  following  types  of  data  bases  in  the  experimentation;  lists,  full  binary 
trees,  directed  acyclic  graphs,  and  directed  cyclic  graphs.  The  parameters  used  to 
characterize  these  data  bases  are  shown  in  table  11.6. 


Remarks 


The  number  of  tuples  in  a 
data  base  with  n  lists  of 
average  length  /  is 
approximately  n(l  —  1). 


The  number  of  tuples  in  a 
data  base  with  n  trees, 
each  .  of  depth  d,  is 
n(2*  -  2). 


Fan-out  and  fan-in 
respectively  refer  to  the 
number  of  arcs  leaving 
and  entering  a  node  in  the 
graph.  Path  length  refers 
to  the  number  of  nodes  in 
a  path  starting  with  a 
node  with  sero  fan-in  and 
ending  with  a  node  with 
sero  fan-out. 


Relation  type 

Parameters 

List 

Number  of  lists,  their 
average  length 

Full  binary  tree 

Number  of  trees,  their 
depth 

Directed  acyclic  graph 

Number  of  tuples,  fan¬ 
out,  fan-in,  path  length 

Directed  cyclic  graph 

Number  of  tuples,  fan¬ 
out,  fan-in,  path  length, 
number  of  cycles  in  the 
graph,  their  average 

length 

Table  11.6.  Database  characterization 

To  facilitate  experimentation,  we  developed  a  D/KB  generator  that  accepts  parameter 
values  as  input  and  creates  a  "random"  data  base  satisfying  these  values. 

11.4.  Rule  Base  Characterization 

We  characterize  the  rule  base  by  characterizing  its  PCG,  which  is  a  directed  cyclic 
graph.  Table  11.7  shows  the  parameters  we  used  to  characterize  the  PCG.  Tne  D/KB 
generator  accepts  these  parameters  as  input  and  creates  a  "random"  directed  cyclic 
graph  satisfying  these  parameters.  It  then  generates  a  rule  base,  such  that  the  PCG  of 
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Parameter 


D/KBMS  architecture 
related  parameters 


Workload 

parameters 


related 


Query  and  update  related 
parameters 


Description 


Optimisation  strategy 


LFP  evaluation  strategy 


Rule  storage  structures 


Total  number  of  rules  in 
the  Stored  D/KB 

Total  number  of  rules  in 
the  Workspace  D/KB 

Total  number  of  derived 
predicates  in  the  Stored 
D/KB 


Number  of  Stored  D/KB 
rules  relevant  to  the  query 

Number  of  derived 
predicates  relevant  to  the 
query 

Total  number  of  tuples  in 
all  base  relations  relevant 
to  the  query 

Number  of  relevant  base 
relation  tuples 

Number  of  edges  in  the 
transitive  closure  of  the 
PCG  of  the  Workspace 
D/KB  rules  after 
extracting  the  relevant 
rules  from  the  Stored 
D/KB 
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Table  11.7.  Rule  base  characterization 

this  rule  base  is  the  above  cyclic  graph.  In  general,  there  are  several  rule  bases  satisfy¬ 
ing  this  condition;  the  D/KB  generator  generates  one  such  rule  base. 

11.5.  Tests  and  Results 

We  now  describe  the  results  of  the  performance  measurement  and  evaluation  tests 
we  performed  using  the  testbed.  These  tests  can  be  categorized  into  two  groups:  (1) 
tests  relating  to  D/KB  query  processing  and  (2)  tests  relating  to  D/KB  updates. 

11.5.1.  Tests  Relating  to  D/KB  Query  Processing 

The  tests  relating  to  D/KB  query  processing  can  be  further  categorized  into  two 
groups:  (a)  tests  relating  to  D/KB  query  compilation  and  (b)  tests  relating  to  D/KB 
query  execution. 

11.5.1.1.  Tests  Relating  to  D/KB  Query  Compilation 

In  this  section,  we  describe  the  tests  we  conducted  relating  to  D/KB  query  compila¬ 
tion.  After  making  several  measurements  of  the  various  components  contributing  to  the 
compilation  time,  we  found  that  the  parameters  that  had  the  most  effect  on  D/KB 
query  compilation  time  were  R.P,R  ,  and  P  .  R  and  /?,,  affect  the  time  to  extract 
the  relevant  rules  from  the  Stored  D/KB,  while  Ps  and  Pf  affect  the  time  to  read  the 
D/KB  data  dictionaries.  The  purpose  of  the  tests  below  is  to  study  the  effect  of  these 
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parameters  on  D/KB  query  compilation  time. 

(1)  Study  the  effect  of  the  total  number  of  rules  in  the  Stored  D/KB,  Rt,  and  the 
number  of  Stored  D/KB  rules  relevant  to  a  query,  Rsr,  on  the  time  to  extract  the 
relevant  rules  from  the  Stored  D/KB,  <c4.  We  varied  Rt  from  29  to  205  in  steps  of 
16,  recording  the  value  of  tc4  for  a  query  with  R/r  =  2.  We  then  repeated  this  pro¬ 
cedure  for  queries  with  Rsr  =  7  and  Rlf  =  20. 

Figure  11.1  shows  the  results  of  this  experiment.  For  each  query,  notice  that  te4  is 
relatively  insensitive  to  Rr  To  understand  why  this  is  so,  let  us  look  at  how  the 
relevant  rules  are  extracted  from  the  Stored  D/KB  for  a  typical  D/KB  query,  say, 

query(X,  Y)  -  p(X,  Z ),  m(Z,  Y). 

This  is  accomplished  via  the  following  SQL  query: 

SELECT  irulesource.rule 
FROM  irulesource ,  ireachablepreds 

WHERE  (ireachablepreds. topredname  =  irulesource. headpredname  OR 

ireachablepreds. f rompredname  =  irulesource. headpredname)  AND 
ireachablepreds.} rompredname  =  "p"  OR 
ireachablepreds.  f  rompredname  =  "m")) 

The  insensitivity  of  tc4  to  Rs  (the  number  of  tuples  in  irulesource)  is  because  iru- 
lesource  is  typically  a  small  enough  relation  to  hold  in  memory  and  because 
ireachablepreds  has  an  index  on  its  columns. 

Notice  in  figure  11.1  that  for  a  given  value  of  Rs,  t(4  increases  with  Rfr,  the 
number  of  rules  in  the  Stored  D/KB  relevant  to  the  query.  This  is  because  Rff  is 
related  to  the  join  selectivity  of  the  above  SQL  query.  Figure  11.2  shows  a  plot  of 
te4  versus  Rff  for  three  different  values  of  Rt. 

The  data  in  figure  11.1  is  for  the  case  where  the  rules  are  stored  in  compiled  form 
in  the  Stored  D/KB  as  the  transitive  closure  of  the  PCG.  If  they  are  stored  in  raw 
source  form  only,  or  if  the  compiled  form  is  just  the  PCG  (as  opposed  to  its  transi¬ 
tive  closure),  the  transitive  closure  of  the  PCG  would  have  to  be  computed  during 
query  compilation.  We  did  not  make  quantitative  measurements  of  te4  versus  Rf 
for  these  cases,  since  we  know  from  our  previous  work  on  algorithms  for  computing 
the  transitive  closure  of  a  database  relation  that  the  performance  deteriorates 
rapidly  for  very  large  relation  sizes  (see  chapter  7). 
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(2)  Study  the  effect  of  the  total  number  of  derived  predicates  in  the  Stored  D/KB,  Pf, 
and  the  number  of  derived  predicates  relevant  to  the  query,  P  ,  on  the  time  to  read 
the  D/KB  data  dictionaries.  The  purpose  of  reading  these  dictionaries  is  to  deter¬ 
mine  the  types  of  the  columns  of  the  base  and  derived  predicates  prior  to  executing 
the  type  inferencing  algorithm.  The  motivation  for  doing  this  experiment  is  that 
reading  the  D/KB  data  dictionaries  involves  accessing  the  Stored  D/KB,  which 
impacts  performance.  The  procedure  here  was  basically  the  same  as  that  in  the 
preceding  experiment.  The  values  of  Pf  were  1,  4,  and  10  for  the  three  queries. 

Figure  11.3  shows  a  plot  of  tc6  versus  Pf.  Notice  that  for  a  given  value  of  Pf ,  tc#  is 
relatively  insensitive  to  P  .  To  see  why  this  is  so,  let  us  look  at  how  the  inten- 
sional  data  dictionary  is  read  for  a  query  having  say,  pt  and  p2  as  the  relevant 
derived  predicates.  This  is  accomplished  via  the  following  SQL  query: 

SELECT  * 

FROM  isystables,  isyscolumns 
WHERE  isystables. tabid  =  isyscolumns. tabid  AND 
(isystables. tabname  =  "p/  OR 
isystables.  tabname  =  "p2”)) 

The  execution  time  of  the  above  query  is  insensitive  to  Pf  (the  number  of  tuples  in 
isystables),  because  we  place  indexes  on  isystables  and  isyscolumns. 

Also  notice  that  for  a  given  value  of  Pf,  tc6  increases  with  Pf.  This  is  because  Pf 
is  related  to  the  join  selectivity  of  the  above  query.  Figure  11.4  shows  a  plot  of 
versus  Pr  for  three  different  values  of  P  . 

We  haven’t  shown  a  plot  of  t  ,  versus  R  ,  but  we  found  t  .  to  be  insensitive  to  R  . 
This  is  because  Rs  affects  <cB  in  so  far  as  it  affects  P  ,  the  number  of  derived  predi¬ 
cates  in  the  Stored  D/KB,  and  as  we  explained  above,  tct  is  insensitive  to  Pf. 

(3)  Study  the  relative  contributions  of  the  different  steps  in  D/KB  query  compilation  on 
the  total  compilation  time  tc.  After  making  several  measurements  of  tei  through 
tcll,  we  found  that  the  principal  contributions  to  tc  came  from  t(2,  fc4,  fe#,  feg,  and 
*cll.  Figure  11.5  shows  pie  charts  of  the  contributions  of  these  components  for 
three  different  queries,  with  R)r  equal  to  1,  7,  and  20.  Notice  that  as  Rtr  increases 
from  1  to  20,  the  relative  contribution  of  tcA,  the  time  to  extract  the  relevant 
Stored  D/KB  rules,  increases  from  25%  to  67%.  Also,  the  rate  of  increase  appears 
to  be  quite  rapid. 

tc%  appears  to  be  making  a  non-trivial  contribution  to  tc .  This  is  actually  due  to 
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the  fact  that  in  our  testbed,  the  evaluation  order  list  is  computed  by  making  a 
Unix  system  call  to  execute  the  Unix  topological  sort  utility.  The  overhead 
imposed  by  the  system  call  is  particularly  significant,  since  the  evaluation  order 
graph  is  typically  quite  small.  We  could  have  avoided  making  the  system  call  by 
implementing  a  topological  sort  algorithm;  this  would  have  had  the  effect  of  mak¬ 
ing  the  contribution  of  fc8  insignificant. 

Finally,  note  that  the  relative  contribution  of  fcll,  the  time  to  compile  the  code 
fragment  generated  by  the  Knowledge  Manager  and  link  it  with  the  run  time 
library  appears  quite  significant.  However,  this  is  very  much  compiler  dependent 
and  can  vary  greatly  from  system  to  system.  We  can  make  a  similar  observation 
about  te2. 

11.5.1.2.  Tests  Relating  to  D/KB  Query  Execution 

Quantitative  analysis  of  D/KB  query  execution  performance  is  complicated  by  the 
fact  that  the  execution  time  is  greatly  influenced  by  the  nature  of  the  query  and  data. 
This  is  because  the  type  of  query  and  data  greatly  affect  the  size  of  the  set  of  relevant 
facts,  D(iV  and  the  amount  of  duplicate  work  done  during  LFP  computation,  which 
were  shown  in  [Banc86]  to  be  two  of  the  most  important  parameters  influencing  D/KB 
query  execution  time.  The  principal  purpose  of  the  tests  described  in  this  section  is  to 
study  the  effect  of  these  parameters  on  D/KB  query  execution  time. 

The  tests  all  use  the  ancestor  query: 

ancestor{X ,  Y)  -  parent{ X,  F). 

ancestor(X ,  Y)  -  parent(X,  Z),  ancestor(Z ,  Y). 


query(X)  -  ancestor  ("  john" ,  X). 

and  tree  structured  data  for  the  parent  relation.  The  results  will  obviously  be  different 
for  other  queries  and  data  types.  Still,  we  can  draw  some  general  conclusions  from  the 
test  results  for  the  ancestor  query  and  tree  structured  data. 

When  studying  the  effect  of  redundant  work,  we  didn’t  directly  measure  this 
parameter.  Rather,  we  measured  the  execution  times  for  naive  and  semi-naive  LFP 
evaluation,  the  difference  between  the  two  indicating  the  impact  of  redundant  work. 

We  now  describe  the  tests  in  more  detail. 

(4)  Study  the  effect  of  the  fraction  of  relevant  facts,  Dir2/Dlf  r  on  D/KB  query  execution 
time,  tf.  We  varied  Dtr/Dtn  in  two  different  ways.  In  the  first  method,  we  kept 
Dff  j  fixed  by  keeping  the  size  of  the  parent  relation  fixed  and  varied  Dir2  by  root¬ 
ing  the  ancestor  query  at  different  sub-trees  of  the  parent  relation.  Thus,  each 
value  of  D)r2  was  obtained  from  a  different  query,  each  query  having  a  different 
constant.  In  the  second  method,  we  kept  Dfr2  fixed  by  fixing  the  query  constant 
and  varied  the  size  of  D  .  by  making  the  parent  relation  progressively  larger. 
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Here,  the  same  query  was  applied  to  parent  relations  of  different  sizes.  Semi-naive 
evaluation  was  used  for  LFP  computation.  Optimization  was  not  used. 

Figure  11.6  shows  a  plot  of  te  versus  Dtf/D)rV  Notice  that  when  D/rl  is  fixed,  <e  is 
insensitive  to  Dsr r  This  is  because  in  the  absence  of  the  magic  set  optimization, 
the  transitive  closure  of  the  entire  parent  relation  is  computed,  regardless  of  the 
percentage  of  this  relation  that  is  actually  relevant  to  the  query.  We  study  the 
impact  of  this  optimization  in  another  experiment. 

On  the  other  hand,  when  Dtr2  is  fixed  but  Dfrl  is  not,  tt  increases  with  Dtrl 
(equivalently,  t(  decreases  with  Dir/DfrX).  This  is  because  the  transitive  closure  is 
being  computed  for  progressively  larger  relation  sizes. 

Study  the  impact  of  redundant  work  done  during  LFP  computation  on  D/KB  query 
execution  time.  We  first  measured  t(  for  several  ancestor  queries  rooted  at  different 
sub-trees  of  the  parent  relation,  keeping  Dsrl  fixed  and  using  semi-naive  LFP 
evaluation.  We  then  repeated  this  procedure  for  naive  LFP  evaluation. 

Figure  11.7  shows  a  plot  of  tf  versus  D  ir/DstX  for  both  naive  and  semi-naive 
evaluation.  Notice  that  for  the  query  and  database  in  this  test,  semi-naive  evalua¬ 
tion  is  between  2.5  to  3  times  faster  than  naive  evaluation.  The  difference  is  due  to 
the  fact  that  semi-naive  evaluation  avoids  a  lot  of  duplicate  work  by  computing 
only  the  differential  of  /  (R)  during  each  iteration  when  evaluating  the  LFP  of 
R  =  /  (R ).  Naive  evaluation,  on  the  other  hand,  recomputes  tuples  computed  in 
previous  iterations. 

Study  the  relative  contributions  of  the  various  steps  in  the  while  loop  of  naive  and 
semi-naive  LFP  evaluation.  Chapter  12  showed  the  pseudo-code  for  naive  and 
semi-naive  LFP  evaluation  used  in  our  testbed.  This  experiment  studies  the  rela¬ 
tive  contributions  of  the  various  steps  in  the  while  loop  of  these  algorithms.  Table 
11.8  shows  these  steps. 

Figure  11.8  shows  the  results  of  this  test.  Notice  that  in  naive  evaluation,  94%  of 
the  time  is  spent  in  evaluating  the  right  hand  side  of  the  recursive  equations  and 
doing  the  termination  check,  while  the  corresponding  activities  consume  82%  of  the 
time  in  semi-naive  evaluation.  The  activities  are  not  quite  the  same  since  semi- 
naive  evaluation  computes  only  the  differential  of  the  right  hand  side  of  the  recur¬ 
sive  equations  during  each  iteration. 

Figure  11.9  shows  a  comparison  of  the  time  taken  to  evaluate  the  right  hand  side 
of  the  recursive  equations  (or  the  differential  in  semi-naive  evaluation)  and  to  do 
the  termination  check  for  naive  and  semi-naive  evaluation.  Notice  that  the  times 
for  naive  evaluation  are  about  2.5  to  3  times  greater  than  those  for  semi-naive 
evaluation.  This  is  the  principal  reason  semi-naive  evaluation  was  found  to  be  2.5 
to  3  times  faster  than  naive  evaluation  in  the  previous  test. 
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Table  11.8.  Steps  in  while  loop  of  LFP  evaluation 

The  termination  check  is  expensive  in  our  implementation  because  the  SQL  inter¬ 
face  between  the  Knowledge  Manager  and  the  DBMS  forces  a  set  difference,  an 
expensive  operation,  to  be  computed  during  this  check. 

(7)  Study  the  impact  of  using  the  magic  set  optimization  on  D/KB  query  execution 
time.  This  test  consists  of  three  parts.  First,  we  measured  <g  as  a  function  of 
Dir/Dlfl  for  the  four  cases  resulting  from  using  naive  and  semi-naive  evaluation 
with  and  without  optimization.  was  varied  by  keeping  D$rl  fixed  and 

varying  Z?jf2.  Figure  11.10  shows  the  results  of  this  test.  Notice  that  tt  is  insensi¬ 
tive  to  Dir2/Dirl  in  the  absence  of  optimization,  since  the  transitive  closure  of  the 
entire  parent  relation  is  computed  in  this  case.  However,  with  optimization,  tf 
increases  with  Dir/Dtrl.  This  is  because  with  optimization  the  transitive  closure  is 
computed  only  for  the  relevant  portion  of  the  parent  relation,  which  grows  progres¬ 
sively  larger  in  this  test. 

The  tradeoffs  in  using  optimization  can  be  clearly  seen  in  figure  11.10:  there  is  a 
crossover  point  beyond  which  optimization  results  in  higher  query  execution  times. 
This  typically  happens  when  the  selectivity  of  the  query  is  high,  i.e.,  when  most  of 
the  database  is  relevant.  To  understand  why  this  is  so,  recall  that  in  the  magic  set 
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strategy,  an  LFP  computation  is  done  is  first  to  evaluate  the  magic  rules  and  deter¬ 
mine  the  set  of  relevant  facts.  Then,  another  LFP  computation  is  done  to  evaluate 
the  modified  rules  and  determine  the  query  results.  In  the  latter  computation,  the 
magic  set  predicates  are  base  relations.  When  the  selectivity  of  the  query  is  low, 
the  size  of  these  relations  is  small  enough  so  that  the  two  LFP  computations 
together  take  less  time  than  a  single  LFP  computation  on  the  original  base  rela¬ 
tions.  However,  when  the  selectivity  of  the  query  is  high,  the  size  of  the  magic  set 
predicates  is  large  and  the  extra  overhead  in  first  computing  them  results  in  a 
higher  overall  execution  time. 

The  crossover  selectivity  where  optimization  degrades  performance  for  semi-naive 
evaluation  is  about  72%,  while  it  is  about  82%  for  naive  evaluation.  The  higher 
crossover  point  for  naive  evaluation  is  due  to  the  fact  that  optimization  has  a 
bigger  impact  on  naive  evaluation  as  it  does  a  lot  of  redundant  work. 


Figure  11.11  shows  the  execution  times  for  the  two  LFP  computations  as  a  function 
of  Dir2/Dif  The  rate  of  increase  of  the  magic  rules  evaluation  is  lower  than  that 
for  the  modified  rules  evaluation.  This  is  because  the  magic  rules  evaluation  time 
depends  mostly  on  Z?  .,  the  s:ze  of  the  base  relations,  which  was  fixed  in  this  test. 
On  the  other  hand,  the  modified  rules  evaluation  time  is  quite  sensitive  to  D/t2,  the 
number  of  relevant  facts,  which  was  varied. 

The  impact  of  optimization  is  significant  for  queries  with  low  selectivity.  For 
example,  notice  from  figure  11.10,  that  for  semi-naive  evaluation  when  only  5%  of 
the  base  relation  tuples  are  relevant,  the  execution  time  with  optimization  is  about 
6  to  7  times  faster  than  without  optimization. 

The  impact  of  optimization  is  particularly  significant  for  very  large  base  relations 
and  very  low  query  selectivity.  The  second  part  of  this  test  studied  this  impact. 
Here,  we  executed  an  ancestor  query  with  very  low  selectivity  (.05%)  against  a 
parent  relation  containing  over  16,000  tuples  and  measured  te  with  and  without 
optimization.  We  found  that  without  optimization,  the  query  took  several  orders 
of  magnitude  longer  to  execute  than  it  did  with  optimization!  We  expect  that  in 
very  large  database  environments,  the  query  selectivity  will  be  small  in  many  cases 
and  so,  this  test  represents  a  very  plausible  scenario. 

In  the  third  part  of  this  test,  we  varied  Z)(r2/Z?|fl  by  keeping  Dfr2  fixed  and  varying 
D  v  Figure  11.12  compares  the  query  execution  times  with  and  without  optimisa¬ 
tion.  As  expected,  with  optimization  the  curve  is  flat,  since  the  number  of  relevant 
facts  is  fixed.  On  the  other  hand,  without  optimization,  the  execution  time 
decreases  with  Dtr2/DirV  since  the  transitive  closure  is  being  computed  for  progres¬ 
sively  larger  relation  sizes. 
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11.5.2.  Tests  Relating  to  D/KB  Updates 

In  this  section,  we  describe  the  tests  we  conducted  relating  to  D/KB  updates.  We 
reiterate  that  in  our  testbed,  D/KB  updates  just  update  the  rule  storage  structures  in 
the  Stored  D/KB  with  the  Workspace  D/KB  rules.  In  particular,  there  is  no  checking 
of  these  rules  against  integrity  constraints  that  may  be  associated  with  the  Stored 
D/KB. 

The  parameters  affecting  D/KB  update  time  are  ft.,  ft  .  ft.  and  7V.  ft.  and  ft 
affect  the  time  taken  to  extract  the  relevant  rules  from  the  Stored  D/KB,  while  Rw  and 
Tw  affect  the  time  taken  to  update  the  ireachablepreds  storage  relation.  The  experi¬ 
ments  below  study  the  effect  of  these  parameters. 

(8)  Study  the  effect  of  Rs  and  ft|r  on  the  D/KB  update  time,  £y.  We  loaded  the 
Workspace  D/KB  with  a  single  rule  and  updated  the  Stored  D/KB,  varying  the 
value  of  Rs  from  9  to  189.  There  were  8  rules  in  the  Stored  D/KB  relevant  to  the 
Workspace  D/KB  rule.  Figure  11.13  shows  a  plot  of  tu  versus  ft(  both  with  and 
without  compiled  rule  storage  structures.  In  the  latter  case,  the  update  time  is 
simply  fu3,  the  time  to  store  the  source  form  of  the  rules.  Notice  that  updates  are 
almost  an  order  of  magnitude  faster  without  compiled  form  rule  storage. 

Also,  tu  is  relatively  insensitive  to  Rr  The  main  reason  for  this  is  that  the  time  to 
extract  the  relevant  rules  from  the  Stored  D/KB  is  a  significant  contributor  to  tu 
(see  next  experiment)  and  this  time  depends  only  on  ftjr  (see  D/KB  query  compila¬ 
tion  experiment  1).  We  have  not  explicitly  studied  the  impact  of  ft(r  on  tu  as  it  is 
the  same  as  the  impact  of  ftjr  on  t  4,  which  we  studied  before.  The  insensitivity  of 
tu  to  Rf  is  significant  because  it  means  that  the  D/KB  update  time  does  not 
degrade  for  very  large  rule  sets. 

(9)  Study  the  relative  contributions  of  the  different  components  of  tu  as  a  function  of 
Rw  and  Tw .  We  updated  a  Stored  D/KB  with  ft(  =  189  with  a  Workspace  D/KB 
containing  38  rules  and  measured  the  values  of  tuV  tu2,  and  tu3.  The  value  of  Tw 
was  137,  i.e.,  the  transitive  closure  of  the  PCG  of  the  Workspace  D/KB  rules  after 
extracting  the  relevant  rules  contained  135  edges.  We  then  repeated  this  procedure 
for  a  Workspace  D/KB  with  1  rule  and  T  =  21.  Figure  11.14  shows  a  pie  chart 
of  the  relative  contributions.  The  main  point  to  note  is  that  tuV  the  time  to 
extract  the  relevant  rules  is  a  significant  component  of  tu.  For  small  values  of  Rw 
and  Tw,  this  time  in  fact  makes  the  bulk  of  the  contribution  to  tu.  However,  for 
large  values  of  Rw  and  T  ,  the  percentage  contribution  of  <ul  decreases,  since  that 
of  £#3  increases.  Also,  notice  that  the  time  to  store  the  source  form  of  the  rules, 
£u2,  contributes  only  a  small  amount  to  the  overall  D/KB  update  time. 
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Figure  11.7.  Comparison  of  naive  and  semi-naive  LFP  evaluation 
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Figure  11.9.  LFP  termination  check  comparison 


11.6.  Conclusions 

This  chapter  presented  and  analyzed  the  results  of  several  experiments  we  per¬ 
formed  to  quantitavely  measure  D/KBMS  performance  and  to  study  performance  as  a 
function  of  various  parameters.  We  list  below  the  conclusions  we  can  draw  from  the 
D/KBMS  performance  measurement  and  evaluation  study. 

1.  In  order  that  the  D/KBMS  be  scalable  to  handle  large  rule  sets,  it  is  important 
that  the  rule  storage  structures  be  such  that  the  time  to  extract  the  relevant  rules 
is  independent  of  the  total  number  of  rules  in  the  Stored  D/KB.  Otherwise,  the 
D/KB  query  compilation  times  will  grow  with  the  size  of  the  rule  base.  Our  exper¬ 
imentation  has  shown  that  storing  the  transitive  closure  of  the  PCG  of  the  rules 
and  placing  an  index  on  the  columns  of  this  storage  structure  achieves  the  effect  of 
making  the  relevant  rules  extraction  time  independent  of  the  rule  base  size.  This  is 
because  with  this  storage  structure,  the  time  to  extract  the  relevant  rules  depends 
only  on  the  number  of  rules  extracted  and  not  on  the  total  number  of  rules. 

2.  There  are  two  important  tradeoffs  that  relate  to  rule  storage  structures.  The  first 
is  a  time-vs-space  tradeoff.  Compiled  form  rule  storage  structures  like  the  transi¬ 
tive  closure  of  the  PCG  use  more  space  but  permit  faster  query  compilation  than 
non-compiled  storage  structures.  The  other  tradeoff  is  between  query  compilation 
time  and  update  time.  Compiled  form  storage  structures  take  longer  to  update, 
sometimes  even  an  order  of  magnitude  longer  as  some  of  our  experiments  indicated, 
than  non-compiled  storage  structures.  The  choice  of  rule  storage  structure  must  be 
dictated  by  the  relative  cost  of  storage  versus  compilation  and  by  the  application 
characteristics  -  whether  it  is  query  intensive  or  update  intensive. 

3.  The  PCG  itself  (as  opposed  to  its  transitive  closure)  has  been  proposed  as  a  rule 
storage  structure.  However,  this  is  not  a  good  choice  for  query  intensive  applica¬ 
tions.  This  is  because  during  query  compilation,  the  transitive  closure  of  the  PCG 
will  have  to  be  computed  to  extract  the  relevant  rules  and  this  can  get  very  time 
consuming  for  rules  with  large  PCGs. 

4.  As  we  argued  before,  from  the  D/KB  query  compilation  point  of  view,  a  good  rule 
storage  structure  is  one  where  the  relevant  rule  extraction  time  depends  only  on  the 
number  of  rules  extracted  and  not  on  the  total  number  of  rules.  However,  we 


found  that  the  time  to  extract  the  relevant  rules  is  very  sensitive  to  the  number  of 
rules  extracted.  A  key  to  avoiding  excessive  compilation  times  is  to  structure  the 
rules  in  such  a  way  that  the  number  of  relevant  rules  for  a  query  is  small.  Object- 
oriented  database  techniques  can  prove  to  very  useful  here.  For  example,  a  small 
set  of  rules  can  be  encapsulated  within  an  object  and  these  rules  can  be  retrieved 
whenever  the  object  receives  a  message  representing  a  query  against  them.  Encap¬ 
sulating  rules  within  an  object  is  a  way  of  structuring  the  rules  so  that  only  the 
relevant  portions  of  the  rule  base  are  processed  during  compilation.  Of  course, 
much  work  needs  to  be  done  to  integrate  the  concepts  of  object-oriented  database 
systems  —  inheritance,  message  processing,  persistent  objects,  etc.  —  with  those  of 
D/KB  query  and  update  processing. 
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5.  Precompilation  of  D/KB  queries  can  prove  to  be  very  useful.  This  is  especially  true 
for  frequently  occurring  queries  with  large  R)f  values.  The  price  of  precompilation 
is  that,  for  precompiled  queries,  information  about  referenced  relations  and  rules 
must  be  recorded.  During  updates,  this  information  is  checked  to  see  whether  the 
update  invalidates  any  compiled  query.  However,  for  applications  involving  few 
updates  and  frequently  occurring  queries  with  large  Rtr  values,  this  price  is  well 
worth  paying. 

6.  Two  of  the  main  parameters  affecting  D/KB  query  execution  time  are  the  ratio  of 
relevant  facts  to  total  number  of  facts  (Z>jr2/Z><r 2)  and  the  amount  of  redundant 
work  done  in  the  while  loop  of  LFP  evaluation.  To  reduce  the  amount  of  redun¬ 
dant  work  and  to  restrict  the  LFP  evaluation  to  the  relevant  database  tuples,  the 
D/KBMS  architecture  can  use  semi-naive  LFP  evaluation  and  the  generalized 
magic  sets  optimization  strategy. 

7.  There  is  a  tradeoff  in  using  optimization:  while  optimization  restricts  LFP  evalua¬ 
tion  to  the  relevant  tuples  of  the  database,  work  must  be  done  to  first  determine 
these  tuples.  There  is  a  crossover  value  of  D)rZ/Djrl  beyond  which  optimization 
actually  results  in  higher  query  execution  times.  Optimization  pays  best  when  the 
selectivity  of  the  query  is  low,  i.e.,  for  queries  that  retrieve  only  a  small  fraction  of 
the  database.  The  benefit  of  optimization  is  particularly  telling  for  queries  with 
very  low  selectivity  and  very  large  base  relations.  For  such  applications,  we  found 
that  without  optimization,  the  query  took  several  orders  of  magnitude  longer  to 
execute  than  it  did  with  optimization!  We  expect  that  in  very  large  database 
environments,  the  query  selectivity  will  be  small  in  many  cases  and  the  use  of 
optimization  is  highly  recommended  despite  the  extra  work  introduced.  Ideally, 
the  D/KBMS  query  optimizer  should  adapt  the  optimization  strategy  dynamically, 
switching  it  on  for  queries  with  low  selectivity  and  off  otherwise. 

8.  Relational  algebra  alone  is  not  a  good  choice  for  the  DBMS  interface,  since  the  LFP 
evaluation  in  this  case  has  to  be  done  via  an  application  program  and  this  intro¬ 
duces  several  inefficiencies.  For  example,  during  each  iteration  of  the  while  loop, 
several  table  copies  are  performed.  Also,  the  termination  check  becomes  a  very 
expensive  operation  since  with  relational  algebra  as  the  DBMS  interface  this 
involves  computing  a  set  difference.  In  fact,  with  relational  algebra,  the  "real" 
work  in  LFP  evaluation,  viz.,  evaluating  the  right  hand  side  of  the  recursive  equa¬ 
tions  (or  their  differential),  takes  up  only  about  30%  of  the  while  loop  execution 
time.  The  rest  of  the  time  is  spent  doing  table  copies,  termination  checking,  and 
clearing  temporary  tables. 

9.  The  above  inefficiencies  cannot  be  overcome  using  parallelism  alone.  While  a 
parallel  relational  database  machine  can  certainly  speed  up  table  copying  and  ter¬ 
mination  checking,  it  does  not  significantly  reduce  the  percentage  contribution  of 
these  operations  to  the  while  loop  execution  time. 

10.  To  achieve  high  performance  in  D/KB  query  execution,  it  is  very  important  that 
the  relational  algebra  interface  be  augmented  with  a  generalized  LFP  operator. 
This  operator  should  accept  a  set  of  recursive  equations  of  the  form, 
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r  =  /,(ri>  •••)  rn)>  *  =  1)  n,  as  input  and  compute  their  least  fixed  point, 
thereby  solving  each  r.  By  including  such  an  operator  in  the  DBMS  interface, 
many  of  the  inefficiencies  that  arise  with  relational  algebra  can  be  alleviated.  We 
mention  several  optimization  possibilities  that  open  up  if  the  DBMS  interface 
included  an  LFP  operator,  none  of  which  are  possible  if  the  interface  was  just  rela¬ 
tional  algebra: 

a.  Table  copying  can  be  avoided  by  manipulating  buffer  pointers. 

b.  The  full  set  difference  operation  during  termination  checking  can  be  avoided. 
This  is  because  as  soon  as  a  tuple  is  found  that  was  not  computed  in  the  pre¬ 
vious  iteration,  the  termination  check  can  stop. 

c.  A  dynamically  adaptable  indexing  strategy  can  be  designed  to  speed  up  the 
evaluation  of  the  right  hand  side  of  the  recursive  equations  or  their 
differential.  This  strategy  would  dynamically  create  and  drop  temporary 
indexes  on  the  base  and  intermediate  derived  relations  depending  on  their 
relative  sizes. 

d.  The  join  strategy  can  be  dynamically  changed  between  iterations  if  necessary, 
depending  on  the  sized  of  the  base  and  intermediate  derived  relations  and  the 
join  selectivities  from  the  previous  iterations. 

11.  The  performance  of  LFP  evaluation  can  be  significantly  improved  by  parallel  and 
pipelined  processing.  We  list  several  strategies  below: 

a.  During  each  iteration,  the  right  hand  side  of  each  recursive  equation  may  be 
evaluated  in  parallel. 

b.  Pipelining  and  data  flow  techniques  may  be  used  to  evaluate  the  relational 
algebra  tree  corresponding  to  the  right  hand  side  of  these  equations. 

c.  Parallel  join  algorithms  may  be  employed  during  this  evaluation. 

12.  In  addition  to  a  general  LFP  operator,  the  DBMS  interface  should  include  com¬ 
monly  occurring  special  LFP  operators,  such  as  transitive  closure.  This  is  because 
it  may  be  possible  to  optimize  the  execution  of  such  special  operators  better  than 
that  of  a  general  LFP  operator.  In  general,  it  will  be  difficult  for  the  query  optim¬ 
izer  to  recognize  that  a  given  set  of  LFP  equations  corresponds  to  one  or  another 
specialized  LFP  operator.  Therefore,  the  Knowledge  Manager  interface  should 
include  ways  of  denoting  such  operators.  Then  the  Knowledge  Manager  can  gen¬ 
erate  code  containing  them  and  the  DBMS  can  execute  this  code  efficiently. 

The  conclusions  above  justify  our  D/KBMS  architecture  specification  methodology. 
Briefly,  using  this  methodology,  we  would  design  a  high  performance  D/KBMS  by  first 
designing  a  parallel  relational  database  machine  that  employs  the  parallel  and  pipelined 
join  algorithms  we  developed  under  the  VLPDF  contract.  Next,  we  would  design  paral¬ 
lel  algorithms  for  LFP  evaluation  using  the  above  join  strategies,  data  flow  and  pipelin¬ 
ing  techniques,  and  semi-naive  evaluation.  Finally,  we  would  design  a  Knowledge 
Manager  that  compiles  Horn  clause  queries  to  relational  algebra  augmented  with  a  gen¬ 
eral  LFP  operator  and  that  uses  the  generalized  magic  sets  strategy  for  restricting  the 
search  space  to  the  relevant  base  relation  tuples.  The  conclusions  above  suggest  that 
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CHAPTER  12 


Conclusions  and  Future  Directions 


This  chapter  summarizes  the  conclusions  from  the  VLPDF  program  and  indicates 
several  directions  for  future  work.  It  is  organized  as  follows.  Sections  12.1  through  12.3 
present  the  principal  conclusions  from  the  three  phases  of  the  VLPDF  program.  Section 
12.4  presents  our  thoughts  on  future  directions. 

12.1.  Phase  I  Conclusions 

This  section  presents  the  principal  conclusions  from  the  six  investigation  studies 
conducted  during  Phase  I. 

Parallel  architectures  for  D/KBMS  application  interface 

•  Shared  memory  greatly  facilitates  implementing  stream- AND  parallelism  and  the 
key  to  high  performance  stream- .4 ND  parallelism  is  an  efficient  shared  memory 
abstraction  on  a  loosely  coupled  architecture.  The  (AMP)2  abstract  machine 
described  in  chapter  3  illustrates  how  such  an  abstraction  can  be  achieved.  It  does 
so  using  a  number  of  optimizations  that  address  critical  problems  in  the  design  of 
efficient  parallel  architectures,  viz.,  communication  and  memory  latencies,  and  syn¬ 
chronization  overheads. 

•  The  conclusion  from  our  work  on  investigating  the  feasibility  of  executing  PAR- 
LOG  programs  on  the  Connection  Machine  architecture  A  coarse  grained,  loosely 
coupled  architecture  is  better  suited  than  the  Connection  Machine  for  executing 
PARLOG  programs,  since  the  form  of  parallelism  best  supported  by  the  CM  is 
directly  opposite  to  that  found  in  PARLOG. 

D/KB  query  processing  concepts 

•  Recursive  query  processing  is  a  key  concept  differentiating  D/KB  query  processing 
from  traditional  database  query  processing. 

•  The  two  basic  strategies  for  Horn  clause  query  evaluation  are:  top-down  evaluation 
and  bottom-up  evaluation.  Top-down  strategies  are  more  efficient  but  more  com¬ 
plex  and  harder  to  implement.  Bottom-up  strategies  are  simpler  and  easy  to  imple¬ 
ment  but  do  a  lot  of  useless  work.  Bottom-up  evaluation  of  nonrecursive  predi¬ 
cates  can  be  accomplished  via  a  straightforward  compilation  to  relational  algebra, 
while  that  of  recursive  predicates  involves  evaluating  the  LFP  of  a  set  of  recursive 
equations. 

•  The  two  basic  strategies  for  bottom-up  LFP  computation  are:  naive  evaluation  and 
semi-naive  evaluation.  Naive  evaluation  is  more  inefficient  as  it  recomputes  tuples 
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computed  during  previous  iterations.  Semi-naive  evaluation  avoid  much  redundant 
work  by  computing  the  differential  of  the  right  hand  side  of  the  recursive  equa¬ 
tions. 

•  Sideways  information  passing  to  restrict  the  search  space  to  the  relevant  base  rela¬ 
tion  tuples  and  rewriting  the  rules  in  the  D/KB  to  an  equivalent  form  whose  LFP 
computation  is  more  efficient  are  the  basic  ideas  behind  D/KB  query  optimization 
strategies. 

Transitive  closure  algorithms 

•  Transitive  closure  represents  an  important  class  of  D/KB  queries. 

•  Warren’s  algorithm  works  better  than  the  logarithmic  iterative  algorithm  and  an 
improved  version  of  this  algorithm  in  two  cases:  (1)  the  relative  size  of  relation  is 
not  much  larger  than  the  size  of  available  memory,  and  (2)  the  path  lengths  in  the 
transitive  closure  graph  vary  greatly.  In  the  second  case,  the  iterative  algorithms 
have  to  join  two  whole  relations  (often  very  large)  iteratively  to  find  a  small 
number  of  tuples  and  the  total  cost  increases  dramatically.  Thus,  our  recomenda- 
tion  is  to  implement  Warren’s  algorithm  for  transitive  closure  in  database  systems 
and  let  the  query  optimizer  select  it  adaptively. 

•  The  HYBRIDTC  algorithm  that  we  developed  and  reported  in  chapter  5  can  pro¬ 
vide  significant  performance  benefits,  particularly  in  large  D/KB  environments, 
since  it  is  very  amenable  to  parallel  processing. 

Join  algorithms 

•  It  is  important  to  choose  appropriate  algorithms  for  a  particular  join  operation 
with  a  given  system  configuration.  Furthermore,  with  a  given  system  and  relations 
to  be  joined,  the  query  optimizer  has  to  carefully  determine  the  number  of  cluster, 
the  number  of  disks  and  the  number  of  processors  which  will  be  used  in  the  join. 
Generally  speaking,  the  hash-based  algorithms  outperform  the  sort-merge  algo¬ 
rithms  if  the  output  tuples  are  not  required  in  the  sorted  order.  However,  in  the 
case  that  the  source  relations  are  already  sorted,  or  the  applications  require  the 
output  tuples  are  sorted  on  the  join  attributes,  the  sort-merge  algorithms  may  be 
advantageous. 

•  In  multiprocessor-multidisk  systems,  high  parallelism  can  be  achieved  by  dividing 
the  total  processing  task  among  processors  and  disks  and  executing  the  subtasks 
concurrently.  However,  in  some  algorithms,  such  as  the  sort-merge  algorithms 
evaluated  in  this  study,  the  parallel  processing  becomes  difficult  for  some  steps 
(final  merge,  for  example).  The  increase  of  the  number  of  processes  cannot  speed 
up  the  processing.  On  the  other  hand,  the  hash-based  algorithms  are  naturally 
parallelizable.  Both  the  partitioning  and  joining  phase  can  be  cc  ■>  currently  exe¬ 
cuted  by  all  participating  processors.  This  is  the  main  reason  that  explains  why 
the  hash-based  algorithms  outperform  the  sort-merge  algorithms  with  regard  to  the 
elapsed  time. 
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•  Among  three  major  system  resources.  CPU.  disk  and  communication  network. 
CPU  seems  not  the  bottleneck  of  the  processing  pipeline  in  general  (only  in  some 
steps  of  the  sort-merge  joins  as  mentioned  above).  For  hash-based  algorithms  a 
small  number  of  processors  at  each  cluster  is  enough  to  provide  the  necessary  pro¬ 
cessing  power.  On  the  other  hand,  disk  I/O  can  be  the  bottleneck  of  the  pipeline, 
although  we  intentionally  used  large  page  size  (32K)  and  very  high  disk-memory 
tranfer  rate  in  our  study.  One  possible  approach  is  to  increase  the  number  of  disks 
in  each  cluster.  This  multi-disk  system  can  efficiently  remove  the  bottleneck 
caused  by  slow  disk  I/O.  However,  the  number  of  disks  that  can  be  attached  to 
one  cluster  must  be  limited  by  the  complexity  of  control. 

•  One  key  point  in  the  design  of  a  parallel  processing  algorithm  is  to  achieve  max¬ 
imum  overlap  among  operations  requiring  different  resources  in  order  to  increase 
the  parallelism  and  reduce  the  effect  of  some  resource  which  is  the  bottleneck  of  the 
pipeline.  For  example,  in  the  hash-based  algorithms,  the  remotely  processed  tuples 
can  be  transferred  either  during  partitioning  or  right  before  their  use  in  the  joining 
phase.  The  total  communication  cost  is  the  same  in  these  two  schemes,  while  their 
overlapping  with  disk  I/O  is  different.  In  the  first  scheme,  all  communication 
occurs  while  the  relations  are  partitioned.  The  second  scheme  distributes  the  com¬ 
munications  cost:  each  relatively  small  amount  of  data  transfer  overlaps  with  disk 
I/O  and  CPU  processing  iin  joining  phases.  Which  scheme  is  better  will  depend  on 
the  relative  speed  of  disk  I/O  and  data  transfer  over  the  network.  This  example 
reminds  us  that  parallelism  between  different  type  of  resources  can  be  further 
increased  by  tuning  the  processing  steps  carefully  for  each  algorithm. 

Fault  tolerance 

•  The  principal  factors  affecting  D/KBMS  availability  are:  system  architecture,  fault 
tolerance  techniques,  database  size,  component  reliability  and  capacity,  and  data 
storage  and  access  method. 

•  Better  availability  is  obtained  when  the  system  has  many  small  clusters.  In  most 
cases.  .YD  <  8  when  availability  peaks. 

•  The  effect  of  fault  tolerance  techniques  on  system  availability  is  very  significant. 
The  reliability  of  a  disk  (MTTF ^  k)  is  usually  a  bottleneck,  and  performance  gain 
due  to  disk  mirroring  is  very  substantial.  Processor  redundancy  significantly  helps 
if  disks  are  mirrored  (or  other  disk  redundancy  methods  are  used).  However.  .VP 
=  2  is  sufficient  for  fault  tolerance.  A  higher  \P  does  not  improve  availability 
significantly.  Fault  tolerance  techniques  for  other  components  are  useful  in  con¬ 
junction  with  fault  tolerance  techniques  for  disks  and  processors. 

•  The  size  of  the  database  has  a  significant  effect  on  system  availability.  If  the  data¬ 
base  is  larger,  more  data  will  be  lost  when  a  hard  failure  occurs:  so  recovery  takes 
longer.  This  degrades  availability.  Also,  as  the  database  size  increases,  the  system 
will  require  more  components  of  given  capacity  for  storage  and  efficient  processing. 
This  can  have  a  very  significant  effect  on  system  availability. 
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•  Higher  MTTF improves  availability  significantly.  System  availability  will  also 
improve  significantly  if  disks  with  higher  capacities  are  used  (provided  MTTF  of 
a  high  capacity  disk  is  not  much  lower  than  that  for  a  low  capacity  disk).  A  sys¬ 
tem  with  a  higher  capacity  interconnect  has  a  significantly  better  fault  tolerance 
when  the  system  consists  of  a  few  large  clusters. 

12.2.  Phase  H  Conclusions 

A  high  performance,  highly  available  D/KBMS  for  very  large  D/KB  environments 
can  be  specified  using  the  following  methodology.  First  design  a  high  performance 
parallel  relational  database  machine  that  employs  the  parallel  and  pipelined  join  algo¬ 
rithms  we  developed  under  the  VLPDF  contract.  Next,  implement  the  hardware  and 
software  fault  tolerance  mechanisms  described  in  chapter  7.  Then,  design  parallel  algo¬ 
rithms  for  LFP  evaluation  using  the  above  join  strategies,  data  flow  and  pipelining  tech¬ 
niques,  and  semi-naive  evaluation.  Finally,  design  a  Knowledge  Manager  that  compiles 
Horn  clause  queries  to  relational  algebra  augmented  with  a  general  LFP  operator  and 
that  uses  the  generalized  magic  sets  strategy  for  restricting  the  search  space  to  the 
relevant  base  relation  tuples. 

12.3.  Phase  III  Conclusions 

This  section  presents  the  conclusions  from  the  test  and  experimentation  work  in 
Phase  III. 

•  In  order  that  the  D/KBMS  be  scalable  to  handle  large  rule  sets,  it  is  important 
that  the  rule  storage  structures  be  such  that  the  time  to  extract  the  relevant  rules 
is  independent  of  the  total  number  of  rules  in  the  Stored  D/KB.  Otherwise,  the 
D/KB  query  compilation  times  will  grow  with  the  size  of  the  rule  base.  Our  exper¬ 
imentation  has  shown  that  storing  the  transitive  closure  of  the  PCG  of  the  rules 
and  placing  an  index  on  the  columns  of  this  storage  structure  achieves  the  effect  of 
making  the  relevant  rules  extraction  time  independent  of  the  rule  base  size. 

•  There  are  two  important  tradeoffs  that  relate  to  rule  storage  structures.  The  first 
is  a  time-vs-space  tradeoff.  Compiled  form  rule  storage  structures  like  the  transi¬ 
tive  closure  of  the  PCG  use  more  space  but  permit  faster  query  compilation  than 
non-compiled  storage  structures.  The  other  tradeoff  is  between  query  compilation 
time  and  update  time.  Compiled  form  storage  structures  take  longer  to  update, 
sometimes  even  an  order  of  magnitude  longer  as  some  of  our  experiments  indicated, 
than  non-compiled  storage  structures.  The  choice  of  rule  storage  structure  must  be 
dictated  by  the  relative  cost  of  storage  versus  compilation  and  by  the  application 
characteristics  —  whether  it  is  query  intensive  or  update  intensive. 

•  The  PCG  itself  (as  opposed  to  its  transitive  closure)  is  not  a  good  choice  for  query 
intensive  applications.  This  is  because  during  query  compilation,  the  transitive  clo¬ 
sure  of  the  PCG  will  have  to  be  computed  to  extract  the  relevant  rules  and  this 
can  get  very  time  consuming  for  rules  with  large  PCGs. 

•  The  time  to  extract  the  relevant  rules  is  very  sensitive  to  the  number  of  rules 
extracted.  A  key  to  avoiding  excessive  compilation  times  is  to  structure  the  rules 
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in  such  a  way  that  the  number  of  relevant  rules  for  a  query  is  small.  Object- 
oriented  database  techniques  can  prove  to  very  useful  here. 

Precompilation  of  D/KB  queries  can  prove  to  be  very  useful.  This  is  especially  true 
for  frequently  occurring  queries  with  large  Rjr  values.  The  price  of  precompilation 
is  that,  for  precompiled  queries,  information  about  referenced  relations  and  rules 
must  be  recorded.  During  updates,  this  information  is  checked  to  see  whether  the 
update  invalidates  any  compiled  query.  However,  for  applications  involving  few 
updates  and  frequently  occurring  queries  with  large  Rjr  values,  this  price  is  well 
worth  paying. 

Two  of  the  main  parameters  affecting  D/KB  query  execution  time  are  the  ratio  of 
relevant  facts  to  total  number  of  facts  {OsrjDtr^  and  the  amount  of  redundant 
work  done  in  the  while  loop  of  LFP  evaluation.  To  reduce  the  amount  of  redun¬ 
dant  work  and  to  restrict  the  LFP  evaluation  to  the  relevant  database  tuples,  the 
D/KBMS  architecture  can  use  semi-naive  LFP  evaluation  and  the  generalized 
magic  sets  optimization  strategy. 

There  is  a  tradeoff  in  using  optimization:  while  optimization  restricts  LFP  evalua¬ 
tion  to  the  relevant  tuples  of  the  database,  work  must  be  done  to  first  determine 
these  tuples.  Optimization  pays  best  when  the  selectivity  of  the  query  is  low.  i.e.. 
for  queries  that  retrieve  only  a  small  fraction  of  the  database.  The  benefit  of 
optimization  is  particularly  telling  for  queries  with  very  low  selectivity  and  very 
large  base  relations. 

Relational  algebra  alone  is  not  a  good  choice  for  the  DBMS  interface,  since  the  LFP 
evaluation  in  this  case  has  to  be  done  via  an  application  program,  which  introduces 
several  inefficiencies. 

The  above  inefficiencies  cannot  be  overcome  using  parallelism  alone.  V\  hile  a 
parallel  relational  database  machine  can  certainly  speed  up  table  copying  and  ter¬ 
mination  checking,  it  does  not  significantly  reduce  the  percentage  contribution  of 
these  operations  to  the  while  loop  execution  time. 

To  achieve  high  performance  in  D/KB  query  execution,  it  is  very  important  that 
the  relational  algebra  interface  be  augmented  with  a  generalized  LFP  operator. 

In  addition  to  a  general  LFP  operator,  the  DBMS  interface  should  include  com¬ 
monly  occurring  special  LFP  operators,  such  as  transitive  closure. 


12.4.  Future  Directions 


This  section  presents  several  directions  for  future  work  based  on  the  lessons  learned 
from  the  VLPDF  program. 

•  Since  the  system  configuration,  that  is.  the  number  of  clusters,  the  number  of  pro¬ 
cessors.  the  number  of  disks,  and  the  size  of  memory  used  in  a  join  operation 
affects  the  performance  along  with  the  relation  size  and  selectivities.  query  optimi¬ 
zation  in  this  multiprocessor  environment  could  be  more  complicated,  and  also 
more  important.  It  would  be  useful  to  more  thoroughly  investigate  the  relative 
behavior  of  different  algorithms  with  regard  to  the  parameters  and  derive  some 
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heuristics  which  can  be  used  in  the  query  processing  process. 

In  our  work  on  transitive  closure  algorithms,  we  have  not  assumed  any  auxiliary 
storage  structures  such  as  clustered  or  non-clustered  indices  and  join  indices.  All 
operations  are  applied  to  the  original  data.  Join  indices  have  been  shown  to 
improve  the  performance  of  join  operations.  They  also  improve  the  performance  of 
iterative  transitive  closure  algorithms.  Further  investigation  of  the  relative  perfor¬ 
mance  improvement  of  Warren’s  algorithm  resulting  from  the  use  of  auxiliary  data 
structure  is  a  worthwhile  task. 

Depth-first  and  breadth-first  algorithms  have  been  explored  extensively  to  solve  the 
general  search  and  tree  traversal  problems.  Since  transitive  closure  computation  is 
basically  a  graph  search  problem,  both  depth-first  and  breadth  first  algorithms  can 
be  employed  to  compute  the  transitive  closure.  Warren’s  algorithm  can  be  viewed 
as  a  depth-first  algorithm  and  iterative  algorithms  can  be  viewed  as  breadth-first 
algorithms.  This  analogy  can  be  useful  for  further  research  into  the  application  of 
combined  breadth-first  and  depth-first  transitive  closure  computation  techniques 
as  has  been  suggested  in  solving  other  graph  search  problems.  One  possible  tech¬ 
nique  is  to  apply  an  iterative  algorithm  a  few  number  of  iterations  first  to  find 
most  of  the  tuples  in  the  transitive  closure  and  then  switch  to  the  Warren’s  algo¬ 
rithm  to  find  the  few  tuples  which  can  be  derived  only  through  longer  search  paths. 

In  general,  complete  transitive  closure  is  seldom  required  by  applications;  a  subset 
of  the  transitive  closure  is  adequate  for  answering  many  queries.  The  algorithm  to 
handle  the  restricted  transitive  closure  queries  is  dependent  on  the  restriction  cri¬ 
teria.  However,  general  mechanisms  for  restricting  the  output  of  each  iteration  of 
transitive  closure  operation  and  terminating  the  transitive  closure  computation 
after  a  specified  number  of  iterations  are  possible.  These  mechanisms  might  also 
be  useful  in  executing  general  least  fixpoint  queries. 

Transitive  closure  is  a  data-intensive  operation.  It  is  possible  to  partition  the  task 
of  this  very  large  database  processing  on  to  multiple  processors  and  improve  the 
performance  of  transitive  closure  computation  significantly.  For  iterative  algorithms 
in  multipr^vossor  environment,  the  join  and  union  operations  in  each  iteration  can 
be  assigned  to  a  separate  processor  improving  the  performance  through  concurrent 
and  pipeline  processing.  For  executing  Warren’s  algorithm  using  multiple  proces¬ 
sors,  the  search  of  subgraphs  starting  from  different  nodes  in  the  graph  can  be 
assigned  to  different  processor(s).  Another  potential  area  for  future  work  is  to 
design,  analyze  and  evaluate  multiprocessor  based  iterative  algorithms  and 
Warren’s  algorithm. 

The  HYBRDDTC  algorithm  described  in  chapter  5  has  excellent  potential  for  paral¬ 
lel  transitive  closure  evaluation,  but  more  work  needs  to  be  done.  We  give  some 
suggestions  below.  Each  processor  or  node  can  work  on  one  or  more  pairs  of  buck¬ 
ets.  The  tuples  generated  at  one  processor  are  either  processed  locally  or  sent  to 
other  processors.  The  only  synchronization  needed  is  the  final  termination  of  the 
whole  computation. 


Further  optimization  of  this  algorithm  is  worthwhile.  One  possibility  is  as  follows: 
the  new  tuples  generated  are  not  only  hashed  on  the  second  attribute  and  inserted 
into  the  corresponding  buckets,  but  also  hashed  on  the  first  attribute  and  inserted 
into  the  second  relation  in  the  join  {R0t  ).  Thus,  more  tuples  can  be  generated  in 
each  iteration,  and  performance  improvement  can  be  expected.  However,  it  is 
somewhat  difficult  to  implement  in  real  system  since  the  size  of  RO  will  change 
during  processing.  Some  sophisticated  memory  management  strategy  and  bucket 
overflow  techniques  have  to  be  developed. 

Query  processing  and  optimization  techniques  and  the  system  fault  tolerance  are 
very  intimately  related.  Query  processing  techniques  and  data  storage  schemes 
depend  on  the  type  and  frequency  of  queries  asked  (i.e.,  the  application).  As  we 
saw  in  our  study,  the  data  storage  scheme  affects  fault  tolerance  significantly. 
Thus,  in  designing  a  real  system,  the  designers  of  query  processing  software  and 
fault  tolerance  have  to  understand  the  application  and  devise  a  storage  scheme  that 
is  acceptable  from  the  query  processing  as  well  as  fault  tolerance  viewpoint. 
Required  levels  of  response  time,  fault  tolerance  and  reliability  are  application 
dependent. 

Define  and  evaluate  additional  fault  tolerance  and  performance  parameters.  The 
availability  measure  evaluated  in  this  report  is  the  strong  availability  defined  in 
section  7.4.2.  Evaluating  weak  availability  may  give  further  insight  into  system 
behavior.  We  also  think  that  the  availability  measure  gives  only  a  part  of  the 
story.  Other  parts  of  system  behavior  are  captured  by  response  time  measures.  To 
evaluate  the  system  more  thoroughly,  additional  parameters  that  combine  no- 
failure  response  time  and  availability  could  be  evaluated.  Examples  of  such  meas¬ 
ures  are  average  response  time  with  failures  and  average  system  throughput 
[Shet87j.  Evaluating  these  parameters  may  be  difficult  but  required  in  light  of  the 
previous  issue. 

Soft  failures  are  tolerated  using  hardware  methods  such  as  processor  pairing,  and 
software  methods  such  as  transaction  management  and  software  reinitialization. 
Although  conceptually  these  techniques  are  not  difficult  to  understand,  their  imple¬ 
mentation  may  be  significantly  difficult.  We  feel  these  techniques  and  their  effect 
on  overall  system  performance  should  be  studied  in  more  detail. 

We  studied  a  data  storage  scheme  that  keeps  two  copies  of  data.  If  more  copies  of 
data  are  kept,  more  failures  can  be  tolerated  in  general.  This  results  in  higher  no¬ 
failure  response  time  for  updates  but  lower  no-failure  response  time  for  queries. 
However,  more  data  will  be  lost  when  a  failure  occurs.  This  will  increase  the 
recovery  time  and  hence  reduce  the  availability.  A  more  detailed  quantitative 
study  needs  to  be  performed  to  find  out  the  optimal  degree  of  replication. 

Fault  tolerance  is  achieved  by  redundancy  in  hardware  and  software.  Redundancy 
means  additional  resources  and  overhead.  In  the  future,  cost  associated  with  these 
additional  resources  should  be  quantified  with  respect  to  the  benefits  of  better  fault 
tolerance.  The  types  of  fault  tolerance  techniques  and  the  amount  of  hardware  and 
data  redundancy  required  should  be  guided  by  application  needs. 
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data  redundancy  required  should  be  guided  by  application  needs. 

We  found  that  the  time  taken  to  extract  the  relevant  rules  can  be  made  indepen¬ 
dent  of  the  total  number  of  rules,  but  it  is  still  remains  very  sensitive  to  the 
number  of  rules  extracted.  Object-oriented  database  techniques  can  prove  to  be 
very  useful  in  reducing  the  time  taken  to  extract  the  relevant  rules  during  D/KB 
query  compilation  since  they  provide  efficient  structuring  of  the  D/KB.  For  exam¬ 
ple,  a  small  set  of  rules  can  be  encapsulated  within  an  object  and  these  rules  can  be 
retrieved  whenever  the  object  receives  a  message  representing  a  query  against 
them.  Encapsulating  rules  within  an  object  is  a  way  of  structuring  the  rules  so 
that  only  the  relevant  portions  of  the  rule  base  are  processed  during  compilation. 
Of  course,  much  work  needs  to  be  done  to  integrate  the  concepts  of  object-oriented 
database  systems  —  inheritance,  message  processing,  persistent  objects,  etc.  —  with 
those  of  D/KB  query  and  update  processing. 

We  found  that  D/KB  query  compilation  can  constitute  a  significant  portion  of 
D/KB  query  processing  time.  Techniques  for  handling  precompiled  D/KB  queries 
need  to  be  integrated  into  the  D/KBMS  architecture  specification  methodology. 

Efficient  implementation  techniques  for  generalized  LFP  operators  should  be  inves¬ 
tigated.  We  mention  several  optimization  possibilities  below:  an  LFP  operator, 
none  of  which  are  possible  if  the  interface  was  just  relational  algebra: 

a.  Table  copying  can  be  avoided  by  manipulating  buffer  pointers. 

b.  The  full  set  difference  operation  during  termination  checking  can  be  avoided. 
This  is  because  as  soon  as  a  tuple  is  found  that  was  not  computed  in  the  pre¬ 
vious  iteration,  the  termination  check  can  stop. 

c.  A  dynamically  adaptable  indexing  strategy  can  be  designed  to  speed  up  the 
evaluation  of  the  right  hand  side  of  the  recursive  equations  or  their 
differential.  This  strategy  would  dynamically  create  and  drop  temporary 
indexes  on  the  base  and  intermediate  derived  relations  depending  on  their 
relative  sizes. 

d.  The  join  strategy  can  be  dynamically  changed  between  iterations  if  necessary, 
depending  on  the  sized  of  the  base  and  intermediate  derived  relations  and  the 
join  selectivities  from  the  previous  iterations. 

Parallel  algorithms  for  general  LFP  evaluation  can  significantly  improve  D/KB 
query  execution  performance.  We  list  several  strategies  below: 

a.  During  each  iteration,  the  right  hand  side  of  each  recursive  equation  may  be 
evaluated  in  parallel. 

b.  Pipelining  and  data  flow  techniques  may  be  used  to  evaluate  the  relational 
algebra  tree  corresponding  to  the  right  hand  side  of  these  equations. 

c.  Parallel  join  algorithms  may  be  employed  during  this  evaluation. 
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